Posts by Tag

Software Engineering

Foundations of scalable systems – book review

1 minute read

Foundations of scalable systems written by Ian Gorton, the book with my highest rate of 5 ⭐️ I highly recommend that every software engineer grasps the distributed system principles outlined in this book.

Building evolutionary architectures – book review

1 minute read

September this year is dedicated to reviewing software architecture books. This time, I’ve read Building Evolutionary Architectures written by Neal Ford. I appreciate the effort put into compiling this book. However, I found it to be more of a collection of existing books, articles and talks rather than offering new...

Fundamentals of software architecture – book review

12 minute read

Recently I’ve read Fundamentals of Software Architecture written by Mark Richards and Neal Ford. I found this book valuable, even though my company doesn’t have a formal architect role. At Allegro, the most experienced senior software engineers take on the responsibilities of a software architect in addition to thei...

Test Driven Development for data engineers

31 minute read

Test Driven Development (TDD) is a well established practice in a software engineering community. It helps to guarantee that code is reliable and error-free by requiring developers to write tests before writing the actual code, and promotes better code design and modularity for easier maintenance and extension. Para...

Back to a senior engineer

7 minute read

I started my professional software engineer career around the year 2000. For the next 14 years I was in an individual contributor role. When I joined Allegro in 2014 I had a choice: be a senior software engineer or become a team leader. I wanted to try something new and I started working as a team leader. After 9 ye...

ChatGPT – new way of learning

24 minute read

This evening I decided to learn something about type classes. In the past I would’ve picked a book with good reviews, or read a set of articles written by functional programming masters. This time, I wanted to check if could talk with machines 😜

Managing technical debt using Dependabot

11 minute read

Today I would like to show you how to manage technical debt by updating project dependencies continuously. Surprisingly, with this technique you will get much more goodies than just updated dependencies:

Technical writing techniques and tools

17 minute read

I’ve managed this blog since 2011. I’m also a primary contributor to the documentation managed by my team. For example: on-boarding guide for new colleagues or a tourist, technical guides, definition of done, architecture decision records and more. As you perhaps already know, software engineers don’t like writing t...

FinOps for data pipelines on Google Cloud Platform

25 minute read

Do you check costs of the data pipelines in exactly the same way as you check overall health, latency or throughput? Nowadays, taking care of cost efficiency is an integral part of every data engineer job. I would like to share my own experiences with applying FinOps discipline in the organization within tens of da...

Spark and Spark Streaming unit testing

11 minute read

When you develop a distributed system, it’s crucial to make it easy to test. Execute tests in a controlled environment, ideally from your IDE. Long develop-test-develop cycle for complex systems could kill your productivity. Below you find my testing strategy for Spark and Spark Streaming applications.

Release It! – book review

7 minute read

Recently I read excellent book Release It! written by Michael Nygard. The book is 7 years old and I don’t know how I could miss the book until now.

How to document your professional experiences

1 minute read

Have you considered what’s important for prospective employer? What’s the most valuable information source about your professional experience? How to document that you are an expert in software engineering? Below you can find some of my tricks: Write a blog, teaching is the best learning method :-) Write an ...

Code coverage for managers and developers

2 minute read

From time to time, people ask me what code coverage by tests should be. Does 60% mean that project is healthy? Or maybe the goal should be 70% or 80%?

How to convince your manager to adopt Git

4 minute read

Distributed Concurrent Versions Systems (DCVSs) like Git or Mercurial has changed software delivery processes significantly. I would not want to go back to the mid ages of Subversion, and I’m able to convince almost any developer to use DCVS. Convincing managers is much more tough task. Below I collected some insigh...

Back to Top ↑

Architecture

Foundations of scalable systems – book review

1 minute read

Foundations of scalable systems written by Ian Gorton, the book with my highest rate of 5 ⭐️ I highly recommend that every software engineer grasps the distributed system principles outlined in this book.

Building evolutionary architectures – book review

1 minute read

September this year is dedicated to reviewing software architecture books. This time, I’ve read Building Evolutionary Architectures written by Neal Ford. I appreciate the effort put into compiling this book. However, I found it to be more of a collection of existing books, articles and talks rather than offering new...

Fundamentals of software architecture – book review

12 minute read

Recently I’ve read Fundamentals of Software Architecture written by Mark Richards and Neal Ford. I found this book valuable, even though my company doesn’t have a formal architect role. At Allegro, the most experienced senior software engineers take on the responsibilities of a software architect in addition to thei...

Programming language doesn’t matter

1 minute read

A few days ago I participated in quick presentation of significant e-commerce platform. The custom platform implemented mostly in PHP and designed as scalable and distributed system. And I was really impressed! Below you can find a short summary of chosen libraries, frameworks and tools.

Release It! – book review

7 minute read

Recently I read excellent book Release It! written by Michael Nygard. The book is 7 years old and I don’t know how I could miss the book until now.

The Twelve-Factor App – part 1

4 minute read

During my studies about “Micro Services” I found comprehensive (but short) document about Twelve-Factor App methodology for building software-as-a-service applications. The orginal paper is published at 12factor.net. Below you can find a short summary of my experiences for the first part of the document. There is a...

How to send email from JEE application

3 minute read

Sending email notifications from enterprise application is very common scenario. I know several methods to solve this puzzle, below you can find short summary. To send an email from the application at least SMTP server address must be configured. Because released application binary (e.g: WAR file) should be porta...

DDD Architecture Summary

5 minute read

In this blog post you can find my general rules for implementing system using Domain Driven Design. Don’t use them blindly but it’s good starting point for DDD practitioners.

Pure JEE or Spring Framework

5 minute read

During my career as J2EE and JEE software developer I have been trying to use pure JEE two o three times. And I decided to don’t repeat this exercise any more, it would be waste of my precious time. Below you can find short but quite comprehensive summary (based on Ilias Tsagklis):

Back to Top ↑

Scala

Test Driven Development for data engineers

31 minute read

Test Driven Development (TDD) is a well established practice in a software engineering community. It helps to guarantee that code is reliable and error-free by requiring developers to write tests before writing the actual code, and promotes better code design and modularity for easier maintenance and extension. Para...

Unified batch and streaming

18 minute read

Unified batch and streaming processing is a data architecture that seamlessly combines both batch and real-time data processing. It enables organizations to gain real-time insights from their data while maintaining the ability to process large volumes of historical data. In the past organizations often dealt with ba...

ChatGPT – new way of learning

24 minute read

This evening I decided to learn something about type classes. In the past I would’ve picked a book with good reviews, or read a set of articles written by functional programming masters. This time, I wanted to check if could talk with machines 😜

Stream processing – part 2

23 minute read

This is the second part of the stream processing blog post series. In the first part I presented aggregations in a fixed, non-overlapping windows. Now you will learn dynamic aggregations in data-driven windows, for which the size of each window depends on the input data instead of a predefined time based pattern. At...

Stream processing – part 1

17 minute read

This is the first part of the stream processing blog post series. From the series you will learn how to develop and test stateful streaming data pipelines.

Kafka Streams DSL vs processor API

29 minute read

Kafka Streams is a Java library for building real-time, highly scalable, fault-tolerant, distributed applications. The library is fully integrated with Kafka and leverages Kafka producer and consumer semantics (e.g: partitioning, rebalancing, data retention and compaction). What’s unique, the only dependency to r...

Spark application assembly for cluster deployments

12 minute read

When I tried to deploy my first Spark application on a YARN cluster, I realized that there was no clear and concise instruction how to prepare the application for deployment. This blog post could be treated as missing manual on how to build Spark application written in Scala to get deployable binary.

Spark and Kafka integration patterns – part 2

21 minute read

In the world beyond batch, streaming data processing is a future of dig data. Despite of the streaming framework using for data processing, tight integration with replayable data source like Apache Kafka is often required. The streaming applications often use Apache Kafka as a data source, or as a destination for ...

Spark and Kafka integration patterns – part 1

less than 1 minute read

I published post on the allegro.tech blog, how to integrate Spark Streaming and Kafka. In the blog post you will find how to avoid java.io.NotSerializableException exception when Kafka producer is used for publishing results of the Spark Streaming processing.

Spark and Spark Streaming unit testing

11 minute read

When you develop a distributed system, it’s crucial to make it easy to test. Execute tests in a controlled environment, ideally from your IDE. Long develop-test-develop cycle for complex systems could kill your productivity. Below you find my testing strategy for Spark and Spark Streaming applications.

Back to Top ↑

Stream Processing

Test Driven Development for data engineers

31 minute read

Test Driven Development (TDD) is a well established practice in a software engineering community. It helps to guarantee that code is reliable and error-free by requiring developers to write tests before writing the actual code, and promotes better code design and modularity for easier maintenance and extension. Para...

Unified batch and streaming

18 minute read

Unified batch and streaming processing is a data architecture that seamlessly combines both batch and real-time data processing. It enables organizations to gain real-time insights from their data while maintaining the ability to process large volumes of historical data. In the past organizations often dealt with ba...

Apache Beam SQL

36 minute read

If you are a BigData engineer who develops batch data pipelines, you might often hear that stream processing is the future. It unlocks the full potential of data that’s often unbounded in nature. You don’t need batch pipelines anymore, implement everything in a streaming fashion. There are plenty of modern and ea...

Stream processing – part 2

23 minute read

This is the second part of the stream processing blog post series. In the first part I presented aggregations in a fixed, non-overlapping windows. Now you will learn dynamic aggregations in data-driven windows, for which the size of each window depends on the input data instead of a predefined time based pattern. At...

Stream processing – part 1

17 minute read

This is the first part of the stream processing blog post series. From the series you will learn how to develop and test stateful streaming data pipelines.

Kafka Streams DSL vs processor API

29 minute read

Kafka Streams is a Java library for building real-time, highly scalable, fault-tolerant, distributed applications. The library is fully integrated with Kafka and leverages Kafka producer and consumer semantics (e.g: partitioning, rebalancing, data retention and compaction). What’s unique, the only dependency to r...

Long-running Spark Streaming jobs on YARN cluster

16 minute read

A long-running Spark Streaming job, once submitted to the YARN cluster should run forever until it’s intentionally stopped. Any interruption introduces substantial processing delays and could lead to data loss or duplicates. Neither YARN nor Apache Spark have been designed for executing long-running services. But th...

Spark and Kafka integration patterns – part 2

21 minute read

In the world beyond batch, streaming data processing is a future of dig data. Despite of the streaming framework using for data processing, tight integration with replayable data source like Apache Kafka is often required. The streaming applications often use Apache Kafka as a data source, or as a destination for ...

Spark and Kafka integration patterns – part 1

less than 1 minute read

I published post on the allegro.tech blog, how to integrate Spark Streaming and Kafka. In the blog post you will find how to avoid java.io.NotSerializableException exception when Kafka producer is used for publishing results of the Spark Streaming processing.

Back to Top ↑

GCP

Unified batch and streaming

18 minute read

Unified batch and streaming processing is a data architecture that seamlessly combines both batch and real-time data processing. It enables organizations to gain real-time insights from their data while maintaining the ability to process large volumes of historical data. In the past organizations often dealt with ba...

Apache Beam SQL

36 minute read

If you are a BigData engineer who develops batch data pipelines, you might often hear that stream processing is the future. It unlocks the full potential of data that’s often unbounded in nature. You don’t need batch pipelines anymore, implement everything in a streaming fashion. There are plenty of modern and ea...

Apache Beam Summit 2022

15 minute read

Last week I virtually attended Apache Beam Summit 2022 held in Austin, Texas. The event concentrates around Apache Beam and the runners like Dataflow, Flink, Spark or Samza. Below you can find my overall impression of the conference and notes from several interesting sessions. If some aspect was particularly appeal...

FinOps for data pipelines on Google Cloud Platform

25 minute read

Do you check costs of the data pipelines in exactly the same way as you check overall health, latency or throughput? Nowadays, taking care of cost efficiency is an integral part of every data engineer job. I would like to share my own experiences with applying FinOps discipline in the organization within tens of da...

GCP Dataproc and Apache Spark tuning

8 minute read

Dataproc is a fully managed and highly scalable Google Cloud Platform service for running Apache Spark. However, “managed” doesn’t relieve you from the proper configuration to squeeze more processing power for less money. Today you will learn the easiest method to configure Dataproc and Spark together to get optimal...

GCP Cloud Composer and Apache Airflow tuning

15 minute read

I would love to only develop streaming pipelines but in reality some of them are still batch oriented. Today you will learn how to properly configure Google Cloud Platform scheduler – Cloud Composer.

Back to Top ↑

Apache Spark

GCP Dataproc and Apache Spark tuning

8 minute read

Dataproc is a fully managed and highly scalable Google Cloud Platform service for running Apache Spark. However, “managed” doesn’t relieve you from the proper configuration to squeeze more processing power for less money. Today you will learn the easiest method to configure Dataproc and Spark together to get optimal...

Apache BigData Europe 2016

10 minute read

Last week I attended Apache Big Data Europe held in Sevilla, Spain. The event concentrates around big data projects under Apache Foundation umbrella. Below you can find my overall impression on the conference and notes from several interesting sessions. The notes are presented as short checklists, if some aspect w...

Long-running Spark Streaming jobs on YARN cluster

16 minute read

A long-running Spark Streaming job, once submitted to the YARN cluster should run forever until it’s intentionally stopped. Any interruption introduces substantial processing delays and could lead to data loss or duplicates. Neither YARN nor Apache Spark have been designed for executing long-running services. But th...

Spark application assembly for cluster deployments

12 minute read

When I tried to deploy my first Spark application on a YARN cluster, I realized that there was no clear and concise instruction how to prepare the application for deployment. This blog post could be treated as missing manual on how to build Spark application written in Scala to get deployable binary.

Spark and Kafka integration patterns – part 2

21 minute read

In the world beyond batch, streaming data processing is a future of dig data. Despite of the streaming framework using for data processing, tight integration with replayable data source like Apache Kafka is often required. The streaming applications often use Apache Kafka as a data source, or as a destination for ...

Spark and Kafka integration patterns – part 1

less than 1 minute read

I published post on the allegro.tech blog, how to integrate Spark Streaming and Kafka. In the blog post you will find how to avoid java.io.NotSerializableException exception when Kafka producer is used for publishing results of the Spark Streaming processing.

Spark and Spark Streaming unit testing

11 minute read

When you develop a distributed system, it’s crucial to make it easy to test. Execute tests in a controlled environment, ideally from your IDE. Long develop-test-develop cycle for complex systems could kill your productivity. Below you find my testing strategy for Spark and Spark Streaming applications.

Back to Top ↑

Apache Kafka

Flink Forward 2024

8 minute read

This week, I attended Flink Forward in Berlin, Germany. The event celebrated the 10th anniversary of Apache Flink. Below, you can find my overall impressions of the conference and notes from several interesting sessions. If an aspect was particularly appealing, I included a reference to supplementary materials.

Kafka Streams DSL vs processor API

29 minute read

Kafka Streams is a Java library for building real-time, highly scalable, fault-tolerant, distributed applications. The library is fully integrated with Kafka and leverages Kafka producer and consumer semantics (e.g: partitioning, rebalancing, data retention and compaction). What’s unique, the only dependency to r...

Apache BigData Europe 2016

10 minute read

Last week I attended Apache Big Data Europe held in Sevilla, Spain. The event concentrates around big data projects under Apache Foundation umbrella. Below you can find my overall impression on the conference and notes from several interesting sessions. The notes are presented as short checklists, if some aspect w...

Spark application assembly for cluster deployments

12 minute read

When I tried to deploy my first Spark application on a YARN cluster, I realized that there was no clear and concise instruction how to prepare the application for deployment. This blog post could be treated as missing manual on how to build Spark application written in Scala to get deployable binary.

Spark and Kafka integration patterns – part 2

21 minute read

In the world beyond batch, streaming data processing is a future of dig data. Despite of the streaming framework using for data processing, tight integration with replayable data source like Apache Kafka is often required. The streaming applications often use Apache Kafka as a data source, or as a destination for ...

Spark and Kafka integration patterns – part 1

less than 1 minute read

I published post on the allegro.tech blog, how to integrate Spark Streaming and Kafka. In the blog post you will find how to avoid java.io.NotSerializableException exception when Kafka producer is used for publishing results of the Spark Streaming processing.

Back to Top ↑

Apache Beam

Test Driven Development for data engineers

31 minute read

Test Driven Development (TDD) is a well established practice in a software engineering community. It helps to guarantee that code is reliable and error-free by requiring developers to write tests before writing the actual code, and promotes better code design and modularity for easier maintenance and extension. Para...

Unified batch and streaming

18 minute read

Unified batch and streaming processing is a data architecture that seamlessly combines both batch and real-time data processing. It enables organizations to gain real-time insights from their data while maintaining the ability to process large volumes of historical data. In the past organizations often dealt with ba...

Apache Beam SQL

36 minute read

If you are a BigData engineer who develops batch data pipelines, you might often hear that stream processing is the future. It unlocks the full potential of data that’s often unbounded in nature. You don’t need batch pipelines anymore, implement everything in a streaming fashion. There are plenty of modern and ea...

Apache Beam Summit 2022

15 minute read

Last week I virtually attended Apache Beam Summit 2022 held in Austin, Texas. The event concentrates around Apache Beam and the runners like Dataflow, Flink, Spark or Samza. Below you can find my overall impression of the conference and notes from several interesting sessions. If some aspect was particularly appeal...

Stream processing – part 2

23 minute read

This is the second part of the stream processing blog post series. In the first part I presented aggregations in a fixed, non-overlapping windows. Now you will learn dynamic aggregations in data-driven windows, for which the size of each window depends on the input data instead of a predefined time based pattern. At...

Stream processing – part 1

17 minute read

This is the first part of the stream processing blog post series. From the series you will learn how to develop and test stateful streaming data pipelines.

Back to Top ↑

Performance

GCP Dataproc and Apache Spark tuning

8 minute read

Dataproc is a fully managed and highly scalable Google Cloud Platform service for running Apache Spark. However, “managed” doesn’t relieve you from the proper configuration to squeeze more processing power for less money. Today you will learn the easiest method to configure Dataproc and Spark together to get optimal...

GCP Cloud Composer and Apache Airflow tuning

15 minute read

I would love to only develop streaming pipelines but in reality some of them are still batch oriented. Today you will learn how to properly configure Google Cloud Platform scheduler – Cloud Composer.

Kafka Streams DSL vs processor API

29 minute read

Kafka Streams is a Java library for building real-time, highly scalable, fault-tolerant, distributed applications. The library is fully integrated with Kafka and leverages Kafka producer and consumer semantics (e.g: partitioning, rebalancing, data retention and compaction). What’s unique, the only dependency to r...

Artifactory Performance Tuning

4 minute read

Few years ago I participated in Kirk Pepperdine Java performance tuning training. One of the greatest technical training which I have ever been! And also great opportunity to visit Crete :-) Let’s check what I remember from the training … In this blog post I would like to show Artifactory memory utilization analy...

Back to Top ↑

Homelab

Infrastructure as Code

3 minute read

From the very beginning, I used an Infrastructure as Code (IaaC) approach in my homelab. However, due to privacy concerns, I couldn’t publish it as open source. Recently, I spent a lot of time separating sensitive information so that I could publish the rest as open source 😊

Homelab upgrade 2025

11 minute read

I built my homelab at the beginning of 2024, see Building your ultimate Homelab blog post. I hosted many services on it, for example: Home Assistant for home automation, Frigate for surveillance, Vaultwarden for password management, Paperless for document management, Omada Software Controller for network management,...

Home Assistant automations

6 minute read

In this blog post, I will share how I use Home Assistant to automate my home in a pragmatic way. Pragmatic means that I’m not trying to automate everything, but only things that make sense to me. For example, for lights automation in the house, I use PIR or microwave motion detectors connected directly to lights, in...

Building your ultimate Homelab – network

15 minute read

Welcome to the second part of my homelab blog series. In this installment, we’ll delve into the intricacies of network configuration for homelab setup. We’ll cover topics such as securely accessing your homelab from the internet using VPN mesh, organizing your local network with VLANs, and implementing ACLs for enha...

Building your ultimate Homelab – hardware

15 minute read

Welcome, fellow tech enthusiast! If you’re anything like me, you’ve probably dreamed of having your own little corner of the digital universe—a place where wires hum, servers whir, and blinking LEDs create a symphony of possibilities. Well, my friend, you’re about to embark on an exciting adventure as we dive headfi...

Back to Top ↑

Spring

How to send email from JEE application

3 minute read

Sending email notifications from enterprise application is very common scenario. I know several methods to solve this puzzle, below you can find short summary. To send an email from the application at least SMTP server address must be configured. Because released application binary (e.g: WAR file) should be porta...

DDD Architecture Summary

5 minute read

In this blog post you can find my general rules for implementing system using Domain Driven Design. Don’t use them blindly but it’s good starting point for DDD practitioners.

Pure JEE or Spring Framework

5 minute read

During my career as J2EE and JEE software developer I have been trying to use pure JEE two o three times. And I decided to don’t repeat this exercise any more, it would be waste of my precious time. Below you can find short but quite comprehensive summary (based on Ilias Tsagklis):

Back to Top ↑

Java

Development Environment Setup

3 minute read

This document is a manual how to configure flexible development environment for Java, JavaScript, Ruby and Python - my primary set of tools. Even if the runtimes installation with apt-get seems to be a trivial task, there is limited control over installed version of the runtime. The goal is to configure environment ...

Pure JEE or Spring Framework

5 minute read

During my career as J2EE and JEE software developer I have been trying to use pure JEE two o three times. And I decided to don’t repeat this exercise any more, it would be waste of my precious time. Below you can find short but quite comprehensive summary (based on Ilias Tsagklis):

Artifactory Performance Tuning

4 minute read

Few years ago I participated in Kirk Pepperdine Java performance tuning training. One of the greatest technical training which I have ever been! And also great opportunity to visit Crete :-) Let’s check what I remember from the training … In this blog post I would like to show Artifactory memory utilization analy...

Back to Top ↑

Linux

Development Environment Setup

3 minute read

This document is a manual how to configure flexible development environment for Java, JavaScript, Ruby and Python - my primary set of tools. Even if the runtimes installation with apt-get seems to be a trivial task, there is limited control over installed version of the runtime. The goal is to configure environment ...

Virtual Box VDI maintenance

1 minute read

Virtual Disk Image (VDI) is a Virtual Box container format for guest hard disk. I found that VDI files on the host system grows over the time. If your VDI file on the host system is much bigger than used spaces on guest partition it is time for compaction:

Artifactory Performance Tuning

4 minute read

Few years ago I participated in Kirk Pepperdine Java performance tuning training. One of the greatest technical training which I have ever been! And also great opportunity to visit Crete :-) Let’s check what I remember from the training … In this blog post I would like to show Artifactory memory utilization analy...

Back to Top ↑

Dataproc

FinOps for data pipelines on Google Cloud Platform

25 minute read

Do you check costs of the data pipelines in exactly the same way as you check overall health, latency or throughput? Nowadays, taking care of cost efficiency is an integral part of every data engineer job. I would like to share my own experiences with applying FinOps discipline in the organization within tens of da...

GCP Dataproc and Apache Spark tuning

8 minute read

Dataproc is a fully managed and highly scalable Google Cloud Platform service for running Apache Spark. However, “managed” doesn’t relieve you from the proper configuration to squeeze more processing power for less money. Today you will learn the easiest method to configure Dataproc and Spark together to get optimal...

Back to Top ↑

Git

GitFlow step by step

4 minute read

Git Flow is a mainstream process for branch per feature development. Git Flow is the best method I’ve found for managing project developed by small to medium project teams. Before you start reading this post you should read two mandatory lectures:

How to convince your manager to adopt Git

4 minute read

Distributed Concurrent Versions Systems (DCVSs) like Git or Mercurial has changed software delivery processes significantly. I would not want to go back to the mid ages of Subversion, and I’m able to convince almost any developer to use DCVS. Convincing managers is much more tough task. Below I collected some insigh...

Back to Top ↑

Node.js

Development Environment Setup

3 minute read

This document is a manual how to configure flexible development environment for Java, JavaScript, Ruby and Python - my primary set of tools. Even if the runtimes installation with apt-get seems to be a trivial task, there is limited control over installed version of the runtime. The goal is to configure environment ...

Back to Top ↑

DDD

DDD Architecture Summary

5 minute read

In this blog post you can find my general rules for implementing system using Domain Driven Design. Don’t use them blindly but it’s good starting point for DDD practitioners.

Back to Top ↑

Apache Hadoop

Apache BigData Europe 2016

10 minute read

Last week I attended Apache Big Data Europe held in Sevilla, Spain. The event concentrates around big data projects under Apache Foundation umbrella. Below you can find my overall impression on the conference and notes from several interesting sessions. The notes are presented as short checklists, if some aspect w...

Long-running Spark Streaming jobs on YARN cluster

16 minute read

A long-running Spark Streaming job, once submitted to the YARN cluster should run forever until it’s intentionally stopped. Any interruption introduces substantial processing delays and could lead to data loss or duplicates. Neither YARN nor Apache Spark have been designed for executing long-running services. But th...

Back to Top ↑

Dataflow

Apache Beam SQL

36 minute read

If you are a BigData engineer who develops batch data pipelines, you might often hear that stream processing is the future. It unlocks the full potential of data that’s often unbounded in nature. You don’t need batch pipelines anymore, implement everything in a streaming fashion. There are plenty of modern and ea...

FinOps for data pipelines on Google Cloud Platform

25 minute read

Do you check costs of the data pipelines in exactly the same way as you check overall health, latency or throughput? Nowadays, taking care of cost efficiency is an integral part of every data engineer job. I would like to share my own experiences with applying FinOps discipline in the organization within tens of da...

Back to Top ↑

BigQuery

Back to Top ↑

Terraform

Infrastructure as Code

3 minute read

From the very beginning, I used an Infrastructure as Code (IaaC) approach in my homelab. However, due to privacy concerns, I couldn’t publish it as open source. Recently, I spent a lot of time separating sensitive information so that I could publish the rest as open source 😊

Back to Top ↑

Python

Development Environment Setup

3 minute read

This document is a manual how to configure flexible development environment for Java, JavaScript, Ruby and Python - my primary set of tools. Even if the runtimes installation with apt-get seems to be a trivial task, there is limited control over installed version of the runtime. The goal is to configure environment ...

Back to Top ↑

Ruby

Development Environment Setup

3 minute read

This document is a manual how to configure flexible development environment for Java, JavaScript, Ruby and Python - my primary set of tools. Even if the runtimes installation with apt-get seems to be a trivial task, there is limited control over installed version of the runtime. The goal is to configure environment ...

Back to Top ↑

JBehave

Back to Top ↑

PHP

Programming language doesn’t matter

1 minute read

A few days ago I participated in quick presentation of significant e-commerce platform. The custom platform implemented mostly in PHP and designed as scalable and distributed system. And I was really impressed! Below you can find a short summary of chosen libraries, frameworks and tools.

Back to Top ↑

Cloud Composer

GCP Cloud Composer and Apache Airflow tuning

15 minute read

I would love to only develop streaming pipelines but in reality some of them are still batch oriented. Today you will learn how to properly configure Google Cloud Platform scheduler – Cloud Composer.

Back to Top ↑

Apache Airflow

GCP Cloud Composer and Apache Airflow tuning

15 minute read

I would love to only develop streaming pipelines but in reality some of them are still batch oriented. Today you will learn how to properly configure Google Cloud Platform scheduler – Cloud Composer.

Back to Top ↑

SQL

Apache Beam SQL

36 minute read

If you are a BigData engineer who develops batch data pipelines, you might often hear that stream processing is the future. It unlocks the full potential of data that’s often unbounded in nature. You don’t need batch pipelines anymore, implement everything in a streaming fashion. There are plenty of modern and ea...

Back to Top ↑

Go

The Go programming language

3 minute read

Go is a modern programming language designed at Google to improve developer productivity. But how to improve productivity? By limiting engineers’ power and flexibility 😂

Back to Top ↑

Markdown

Technical writing techniques and tools

17 minute read

I’ve managed this blog since 2011. I’m also a primary contributor to the documentation managed by my team. For example: on-boarding guide for new colleagues or a tourist, technical guides, definition of done, architecture decision records and more. As you perhaps already know, software engineers don’t like writing t...

Back to Top ↑

Ansible

Infrastructure as Code

3 minute read

From the very beginning, I used an Infrastructure as Code (IaaC) approach in my homelab. However, due to privacy concerns, I couldn’t publish it as open source. Recently, I spent a lot of time separating sensitive information so that I could publish the rest as open source 😊

Back to Top ↑