GCP Dataproc and Apache Spark tuning
Dataproc is a fully managed and highly scalable Google Cloud Platform service for running Apache Spark. However, “managed” does not relieve you from the prop...
Dataproc is a fully managed and highly scalable Google Cloud Platform service for running Apache Spark. However, “managed” does not relieve you from the prop...
A long-running Spark Streaming job, once submitted to the YARN cluster should run forever until it is intentionally stopped. Any interruption introduces subs...
When I tried to deploy my first Spark application on a YARN cluster, I realized that there was no clear and concise instruction how to prepare the applicati...
In the world beyond batch, streaming data processing is a future of dig data. Despite of the streaming framework using for data processing, tight integrati...
I published post on the allegro.tech blog, how to integrate Spark Streaming and Kafka. In the blog post you will find how to avoid java.io.NotSerializableExc...
When you develop a distributed system, it is crucial to make it easy to test. Execute tests in a controlled environment, ideally from your IDE. Long develop-...