Passionate Developer

Software Engineering at Google – book review

September 22, 2022 9 minute read

Recently I’ve read “Software Engineering at Google” curated by Titus Winters, Tom Manshreck and Hyrum Wright.

Apache Beam SQL

September 14, 2022 36 minute read

If you are a BigData engineer who develops batch data pipelines, you might often hear that stream processing is the future. It unlocks the full potential of data that’s often unbounded in nature. You don’t need batch pipelines anymore, implement everything in a streaming fashion. There are plenty of modern and ea...

Apache Beam Summit 2022

July 7, 2022 15 minute read

Last week I virtually attended Apache Beam Summit 2022 held in Austin, Texas. The event concentrates around Apache Beam and the runners like Dataflow, Flink, Spark or Samza. Below you can find my overall impression of the conference and notes from several interesting sessions. If some aspect was particularly appeal...

GCP troubleshooting with API monitoring

July 7, 2022 7 minute read

When you deploy any non-trivial data pipelines using complex infrastructure you should expect some troubles sooner or later.

FinOps for data pipelines on Google Cloud Platform

April 2, 2022 25 minute read

Do you check costs of the data pipelines in exactly the same way as you check overall health, latency or throughput? Nowadays, taking care of cost efficiency is an integral part of every data engineer job. I would like to share my own experiences with applying FinOps discipline in the organization within tens of da...

GCP Dataproc and Apache Spark tuning

March 24, 2022 8 minute read

Dataproc is a fully managed and highly scalable Google Cloud Platform service for running Apache Spark. However, “managed” doesn’t relieve you from the proper configuration to squeeze more processing power for less money. Today you will learn the easiest method to configure Dataproc and Spark together to get optimal...

GCP Cloud Composer and Apache Airflow tuning

March 15, 2022 15 minute read

I would love to only develop streaming pipelines but in reality some of them are still batch oriented. Today you will learn how to properly configure Google Cloud Platform scheduler – Cloud Composer.

Stream processing – part 2

March 8, 2022 23 minute read

This is the second part of the stream processing blog post series. In the first part I presented aggregations in a fixed, non-overlapping windows. Now you will learn dynamic aggregations in data-driven windows, for which the size of each window depends on the input data instead of a predefined time based pattern. At...

Stream processing – part 1

January 28, 2022 17 minute read

This is the first part of the stream processing blog post series. From the series you will learn how to develop and test stateful streaming data pipelines.

Kafka Streams DSL vs processor API

November 2, 2017 29 minute read

Kafka Streams is a Java library for building real-time, highly scalable, fault-tolerant, distributed applications. The library is fully integrated with Kafka and leverages Kafka producer and consumer semantics (e.g: partitioning, rebalancing, data retention and compaction). What’s unique, the only dependency to r...

Marcin Kuthan

Recent Posts