Posts by Year

2023

Test Driven Development for Data Engineers

31 minute read

Test Driven Development (TDD) is a well established practice in a software engineering community. It helps to guarantee that code is reliable and error-free by requiring developers to write tests before writing the actual code, and promotes better code design and modularity for easier maintenance and extension. Para...

Unified batch and streaming

18 minute read

Unified batch and streaming processing is a data architecture that seamlessly combines both batch and real-time data processing. It enables organizations to gain real-time insights from their data while maintaining the ability to process large volumes of historical data. In the past organizations often dealt with ba...

Back to a senior engineer

7 minute read

I started my professional software engineer career around the year 2000. For the next 14 years I was in an individual contributor role. When I joined Allegro in 2014 I had a choice: be a senior software engineer or become a team leader. I wanted to try something new and I started working as a team leader. After 9 ye...

ChatGPT – new way of learning

24 minute read

This evening I decided to learn something about type classes. In the past I would’ve picked a book with good reviews, or read a set of articles written by functional programming masters. This time, I wanted to check if could talk with machines 😜

Managing technical debt using Dependabot

11 minute read

Today I would like to show you how to manage technical debt by updating project dependencies continuously. Surprisingly, with this technique you will get much more goodies than just updated dependencies:

Back to Top ↑

2022

Technical writing techniques and tools

17 minute read

I’ve managed this blog since 2011. I’m also a primary contributor to the documentation managed by my team. For example: on-boarding guide for new colleagues or a tourist, technical guides, definition of done, architecture decision records and more. As you perhaps already know, software engineers don’t like writing t...

The Go programming language

3 minute read

Go is a modern programming language designed at Google to improve developer productivity. But how to improve productivity? By limiting engineers’ power and flexibility 😂

Apache Beam SQL

36 minute read

If you are a BigData engineer who develops batch data pipelines, you might often hear that stream processing is the future. It unlocks the full potential of data that’s often unbounded in nature. You don’t need batch pipelines anymore, implement everything in a streaming fashion. There are plenty of modern and ea...

Apache Beam Summit 2022

15 minute read

Last week I virtually attended Apache Beam Summit 2022 held in Austin, Texas. The event concentrates around Apache Beam and the runners like Dataflow, Flink, Spark or Samza. Below you can find my overall impression of the conference and notes from several interesting sessions. If some aspect was particularly appeal...

FinOps for data pipelines on Google Cloud Platform

25 minute read

Do you check costs of the data pipelines in exactly the same way as you check overall health, latency or throughput? Nowadays, taking care of cost efficiency is an integral part of every data engineer job. I would like to share my own experiences with applying FinOps discipline in the organization within tens of da...

GCP Dataproc and Apache Spark tuning

8 minute read

Dataproc is a fully managed and highly scalable Google Cloud Platform service for running Apache Spark. However, “managed” doesn’t relieve you from the proper configuration to squeeze more processing power for less money. Today you will learn the easiest method to configure Dataproc and Spark together to get optimal...

GCP Cloud Composer and Apache Airflow tuning

15 minute read

I would love to only develop streaming pipelines but in reality some of them are still batch oriented. Today you will learn how to properly configure Google Cloud Platform scheduler – Cloud Composer.

Stream processing – part 2

23 minute read

This is the second part of the stream processing blog post series. In the first part I presented aggregations in a fixed, non-overlapping windows. Now you will learn dynamic aggregations in data-driven windows, for which the size of each window depends on the input data instead of a predefined time based pattern. At...

Stream processing – part 1

17 minute read

This is the first part of the stream processing blog post series. From the series you will learn how to develop and test stateful streaming data pipelines.

Back to Top ↑

2017

Kafka Streams DSL vs processor API

29 minute read

Kafka Streams is a Java library for building real-time, highly scalable, fault-tolerant, distributed applications. The library is fully integrated with Kafka and leverages Kafka producer and consumer semantics (e.g: partitioning, rebalancing, data retention and compaction). What’s unique, the only dependency to r...

Back to Top ↑

2016

Apache BigData Europe 2016

10 minute read

Last week I attended Apache Big Data Europe held in Sevilla, Spain. The event concentrates around big data projects under Apache Foundation umbrella. Below you can find my overall impression on the conference and notes from several interesting sessions. The notes are presented as short checklists, if some aspect w...

Long-running Spark Streaming jobs on YARN cluster

16 minute read

A long-running Spark Streaming job, once submitted to the YARN cluster should run forever until it’s intentionally stopped. Any interruption introduces substantial processing delays and could lead to data loss or duplicates. Neither YARN nor Apache Spark have been designed for executing long-running services. But th...

Spark application assembly for cluster deployments

12 minute read

When I tried to deploy my first Spark application on a YARN cluster, I realized that there was no clear and concise instruction how to prepare the application for deployment. This blog post could be treated as missing manual on how to build Spark application written in Scala to get deployable binary.

Spark and Kafka integration patterns – part 2

21 minute read

In the world beyond batch, streaming data processing is a future of dig data. Despite of the streaming framework using for data processing, tight integration with replayable data source like Apache Kafka is often required. The streaming applications often use Apache Kafka as a data source, or as a destination for ...

Back to Top ↑

2015

Spark and Kafka integration patterns – part 1

less than 1 minute read

I published post on the allegro.tech blog, how to integrate Spark Streaming and Kafka. In the blog post you will find how to avoid java.io.NotSerializableException exception when Kafka producer is used for publishing results of the Spark Streaming processing.

Spark and Spark Streaming unit testing

11 minute read

When you develop a distributed system, it’s crucial to make it easy to test. Execute tests in a controlled environment, ideally from your IDE. Long develop-test-develop cycle for complex systems could kill your productivity. Below you find my testing strategy for Spark and Spark Streaming applications.

Back to Top ↑

2014

Programming language doesn’t matter

1 minute read

A few days ago I participated in quick presentation of significant e-commerce platform. The custom platform implemented mostly in PHP and designed as scalable and distributed system. And I was really impressed! Below you can find a short summary of chosen libraries, frameworks and tools.

Release It! – book review

7 minute read

Recently I read excellent book Release It! written by Michael Nygard. The book is 7 years old and I don’t know how I could miss the book until now.

The Twelve-Factor App – part 1

4 minute read

During my studies about “Micro Services” I found comprehensive (but short) document about Twelve-Factor App methodology for building software-as-a-service applications. The orginal paper is published at 12factor.net. Below you can find a short summary of my experiences for the first part of the document. There is a...

Back to Top ↑

2013

How to send email from JEE application

3 minute read

Sending email notifications from enterprise application is very common scenario. I know several methods to solve this puzzle, below you can find short summary. To send an email from the application at least SMTP server address must be configured. Because released application binary (e.g: WAR file) should be porta...

DDD Architecture Summary

5 minute read

In this blog post you can find my general rules for implementing system using Domain Driven Design. Don’t use them blindly but it’s good starting point for DDD practitioners.

Development Environment Setup

3 minute read

This document is a manual how to configure flexible development environment for Java, JavaScript, Ruby and Python - my primary set of tools. Even if the runtimes installation with apt-get seems to be a trivial task, there is limited control over installed version of the runtime. The goal is to configure environment ...

GitFlow step by step

4 minute read

Git Flow is a mainstream process for branch per feature development. Git Flow is the best method I’ve found for managing project developed by small to medium project teams. Before you start reading this post you should read two mandatory lectures:

How to document your professional experiences

1 minute read

Have you considered what’s important for prospective employer? What’s the most valuable information source about your professional experience? How to document that you are an expert in software engineering? Below you can find some of my tricks: Write a blog, teaching is the best learning method :-) Write an ...

Back to Top ↑

2012

Virtual Box VDI maintenance

1 minute read

Virtual Disk Image (VDI) is a Virtual Box container format for guest hard disk. I found that VDI files on the host system grows over the time. If your VDI file on the host system is much bigger than used spaces on guest partition it is time for compaction:

Code coverage for managers and developers

2 minute read

From time to time, people ask me what code coverage by tests should be. Does 60% mean that project is healthy? Or maybe the goal should be 70% or 80%?

How to convince your manager to adopt Git

4 minute read

Distributed Concurrent Versions Systems (DCVSs) like Git or Mercurial has changed software delivery processes significantly. I would not want to go back to the mid ages of Subversion, and I’m able to convince almost any developer to use DCVS. Convincing managers is much more tough task. Below I collected some insigh...

Pure JEE or Spring Framework

5 minute read

During my career as J2EE and JEE software developer I have been trying to use pure JEE two o three times. And I decided to don’t repeat this exercise any more, it would be waste of my precious time. Below you can find short but quite comprehensive summary (based on Ilias Tsagklis):

Back to Top ↑

2011

Artifactory Performance Tuning

4 minute read

Few years ago I participated in Kirk Pepperdine Java performance tuning training. One of the greatest technical training which I have ever been! And also great opportunity to visit Crete :-) Let’s check what I remember from the training … In this blog post I would like to show Artifactory memory utilization analy...

Back to Top ↑