Recent Posts

Kafka Streams DSL vs processor API

29 minute read

Kafka Streams is a Java library for building real-time, highly scalable, fault-tolerant, distributed applications. The library is fully integrated with Kafka and leverages Kafka producer and consumer semantics (e.g: partitioning, rebalancing, data retention and compaction). What’s unique, the only dependency to r...

Apache BigData Europe 2016

10 minute read

Last week I attended Apache Big Data Europe held in Sevilla, Spain. The event concentrates around big data projects under Apache Foundation umbrella. Below you can find my overall impression on the conference and notes from several interesting sessions. The notes are presented as short checklists, if some aspect w...

Long-running Spark Streaming jobs on YARN cluster

16 minute read

A long-running Spark Streaming job, once submitted to the YARN cluster should run forever until it’s intentionally stopped. Any interruption introduces substantial processing delays and could lead to data loss or duplicates. Neither YARN nor Apache Spark have been designed for executing long-running services. But th...

Spark application assembly for cluster deployments

12 minute read

When I tried to deploy my first Spark application on a YARN cluster, I realized that there was no clear and concise instruction how to prepare the application for deployment. This blog post could be treated as missing manual on how to build Spark application written in Scala to get deployable binary.

Spark and Kafka integration patterns – part 2

21 minute read

In the world beyond batch, streaming data processing is a future of dig data. Despite of the streaming framework using for data processing, tight integration with replayable data source like Apache Kafka is often required. The streaming applications often use Apache Kafka as a data source, or as a destination for ...

Spark and Kafka integration patterns – part 1

less than 1 minute read

I published post on the allegro.tech blog, how to integrate Spark Streaming and Kafka. In the blog post you will find how to avoid java.io.NotSerializableException exception when Kafka producer is used for publishing results of the Spark Streaming processing.

Spark and Spark Streaming unit testing

11 minute read

When you develop a distributed system, it’s crucial to make it easy to test. Execute tests in a controlled environment, ideally from your IDE. Long develop-test-develop cycle for complex systems could kill your productivity. Below you find my testing strategy for Spark and Spark Streaming applications.

Programming language doesn’t matter

1 minute read

A few days ago I participated in quick presentation of significant e-commerce platform. The custom platform implemented mostly in PHP and designed as scalable and distributed system. And I was really impressed! Below you can find a short summary of chosen libraries, frameworks and tools.