Posts by Tag

Software Engineering

Technical writing techniques and tools

17 minute read

I’ve managed this blog since 2011. I’m also a primary contributor to the documentation managed by my team. For example: on-boarding guide for new colleagues or a tourist, technical guides, definition of done, architecture decision records and more. As you perhaps already know, software engineers don’t like writing t...

FinOps for data pipelines on Google Cloud Platform

25 minute read

Do you check costs of the data pipelines in exactly the same way as you check overall health, latency or throughput? Nowadays, taking care of cost efficiency is an integral part of every data engineer job. I would like to share my own experiences with applying FinOps discipline in the organization within tens of da...

Spark and Spark Streaming unit testing

11 minute read

When you develop a distributed system, it’s crucial to make it easy to test. Execute tests in a controlled environment, ideally from your IDE. Long develop-test-develop cycle for complex systems could kill your productivity. Below you find my testing strategy for Spark and Spark Streaming applications.

Release It! – book review

7 minute read

Recently I read excellent book Release It! written by Michael Nygard. The book is 7 years old and I don’t know how I could miss the book until now.

How to document your professional experiences

1 minute read

Have you considered what’s important for prospective employer? What’s the most valuable information source about your professional experience? How to document that you are an expert in software engineering? Below you can find some of my tricks: Write a blog, teaching is the best learning method :-) Write an ...

Code coverage for managers and developers

2 minute read

From time to time, people ask me what code coverage by tests should be. Does 60% mean that project is healthy? Or maybe the goal should be 70% or 80%?

How to convince your manager to adopt Git

4 minute read

Distributed Concurrent Versions Systems (DCVSs) like Git or Mercurial has changed software delivery processes significantly. I would not want to go back to the mid ages of Subversion, and I’m able to convince almost any developer to use DCVS. Convincing managers is much more tough task. Below I collected some insigh...

Back to Top ↑

Architecture

Programming language doesn’t matter

1 minute read

A few days ago I participated in quick presentation of significant e-commerce platform. The custom platform implemented mostly in PHP and designed as scalable and distributed system. And I was really impressed! Below you can find a short summary of chosen libraries, frameworks and tools.

Release It! – book review

7 minute read

Recently I read excellent book Release It! written by Michael Nygard. The book is 7 years old and I don’t know how I could miss the book until now.

The Twelve-Factor App – part 1

4 minute read

During my studies about “Micro Services” I found comprehensive (but short) document about Twelve-Factor App methodology for building software-as-a-service applications. The orginal paper is published at 12factor.net. Below you can find a short summary of my experiences for the first part of the document. There is a...

How to send email from JEE application

3 minute read

Sending email notifications from enterprise application is very common scenario. I know several methods to solve this puzzle, below you can find short summary. To send an email from the application at least SMTP server address must be configured. Because released application binary (e.g: WAR file) should be porta...

DDD Architecture Summary

5 minute read

In this blog post you can find my general rules for implementing system using Domain Driven Design. Don’t use them blindly but it’s good starting point for DDD practitioners.

Pure JEE or Spring Framework

5 minute read

During my career as J2EE and JEE software developer I have been trying to use pure JEE two o three times. And I decided to don’t repeat this exercise any more, it would be waste of my precious time. Below you can find short but quite comprehensive summary (based on Ilias Tsagklis):

Back to Top ↑

Apache Spark

GCP Dataproc and Apache Spark tuning

8 minute read

Dataproc is a fully managed and highly scalable Google Cloud Platform service for running Apache Spark. However, “managed” doesn’t relieve you from the proper configuration to squeeze more processing power for less money. Today you will learn the easiest method to configure Dataproc and Spark together to get optimal...

Apache BigData Europe 2016

10 minute read

Last week I attended Apache Big Data Europe held in Sevilla, Spain. The event concentrates around big data projects under Apache Foundation umbrella. Below you can find my overall impression on the conference and notes from several interesting sessions. The notes are presented as short checklists, if some aspect w...

Long-running Spark Streaming jobs on YARN cluster

16 minute read

A long-running Spark Streaming job, once submitted to the YARN cluster should run forever until it’s intentionally stopped. Any interruption introduces substantial processing delays and could lead to data loss or duplicates. Neither YARN nor Apache Spark have been designed for executing long-running services. But th...

Spark application assembly for cluster deployments

12 minute read

When I tried to deploy my first Spark application on a YARN cluster, I realized that there was no clear and concise instruction how to prepare the application for deployment. This blog post could be treated as missing manual on how to build Spark application written in Scala to get deployable binary.

Spark and Kafka integration patterns – part 2

21 minute read

In the world beyond batch, streaming data processing is a future of dig data. Despite of the streaming framework using for data processing, tight integration with replayable data source like Apache Kafka is often required. The streaming applications often use Apache Kafka as a data source, or as a destination for ...

Spark and Kafka integration patterns – part 1

less than 1 minute read

I published post on the allegro.tech blog, how to integrate Spark Streaming and Kafka. In the blog post you will find how to avoid java.io.NotSerializableException exception when Kafka producer is used for publishing results of the Spark Streaming processing.

Spark and Spark Streaming unit testing

11 minute read

When you develop a distributed system, it’s crucial to make it easy to test. Execute tests in a controlled environment, ideally from your IDE. Long develop-test-develop cycle for complex systems could kill your productivity. Below you find my testing strategy for Spark and Spark Streaming applications.

Back to Top ↑

Scala

Stream processing – part 2

23 minute read

This is the second part of the stream processing blog post series. In the first part I presented aggregations in a fixed, non-overlapping windows. Now you will learn dynamic aggregations in data-driven windows, for which the size of each window depends on the input data instead of a predefined time based pattern. A...

Stream processing – part 1

17 minute read

This is the very first part of the stream processing blog post series. From the series you will learn how to develop and test stateful streaming data pipelines.

Kafka Streams DSL vs processor API

29 minute read

Kafka Streams is a Java library for building real-time, highly scalable, fault-tolerant, distributed applications. The library is fully integrated with Kafka and leverages Kafka producer and consumer semantics (e.g: partitioning, rebalancing, data retention and compaction). What’s unique, the only dependency to r...

Spark application assembly for cluster deployments

12 minute read

When I tried to deploy my first Spark application on a YARN cluster, I realized that there was no clear and concise instruction how to prepare the application for deployment. This blog post could be treated as missing manual on how to build Spark application written in Scala to get deployable binary.

Spark and Kafka integration patterns – part 2

21 minute read

In the world beyond batch, streaming data processing is a future of dig data. Despite of the streaming framework using for data processing, tight integration with replayable data source like Apache Kafka is often required. The streaming applications often use Apache Kafka as a data source, or as a destination for ...

Spark and Kafka integration patterns – part 1

less than 1 minute read

I published post on the allegro.tech blog, how to integrate Spark Streaming and Kafka. In the blog post you will find how to avoid java.io.NotSerializableException exception when Kafka producer is used for publishing results of the Spark Streaming processing.

Spark and Spark Streaming unit testing

11 minute read

When you develop a distributed system, it’s crucial to make it easy to test. Execute tests in a controlled environment, ideally from your IDE. Long develop-test-develop cycle for complex systems could kill your productivity. Below you find my testing strategy for Spark and Spark Streaming applications.

Back to Top ↑

Stream Processing

Apache Beam SQL

36 minute read

If you are a BigData engineer who develops batch data pipelines, you might often hear that stream processing is the future. It unlocks the full potential of data that’s often unbounded in nature. You don’t need batch pipelines anymore, implement everything in a streaming fashion. There are plenty of modern and ea...

Stream processing – part 2

23 minute read

This is the second part of the stream processing blog post series. In the first part I presented aggregations in a fixed, non-overlapping windows. Now you will learn dynamic aggregations in data-driven windows, for which the size of each window depends on the input data instead of a predefined time based pattern. A...

Stream processing – part 1

17 minute read

This is the very first part of the stream processing blog post series. From the series you will learn how to develop and test stateful streaming data pipelines.

Kafka Streams DSL vs processor API

29 minute read

Kafka Streams is a Java library for building real-time, highly scalable, fault-tolerant, distributed applications. The library is fully integrated with Kafka and leverages Kafka producer and consumer semantics (e.g: partitioning, rebalancing, data retention and compaction). What’s unique, the only dependency to r...

Long-running Spark Streaming jobs on YARN cluster

16 minute read

A long-running Spark Streaming job, once submitted to the YARN cluster should run forever until it’s intentionally stopped. Any interruption introduces substantial processing delays and could lead to data loss or duplicates. Neither YARN nor Apache Spark have been designed for executing long-running services. But th...

Spark and Kafka integration patterns – part 2

21 minute read

In the world beyond batch, streaming data processing is a future of dig data. Despite of the streaming framework using for data processing, tight integration with replayable data source like Apache Kafka is often required. The streaming applications often use Apache Kafka as a data source, or as a destination for ...

Spark and Kafka integration patterns – part 1

less than 1 minute read

I published post on the allegro.tech blog, how to integrate Spark Streaming and Kafka. In the blog post you will find how to avoid java.io.NotSerializableException exception when Kafka producer is used for publishing results of the Spark Streaming processing.

Back to Top ↑

GCP

Apache Beam SQL

36 minute read

If you are a BigData engineer who develops batch data pipelines, you might often hear that stream processing is the future. It unlocks the full potential of data that’s often unbounded in nature. You don’t need batch pipelines anymore, implement everything in a streaming fashion. There are plenty of modern and ea...

Apache Beam Summit 2022

15 minute read

Last week I virtually attended Apache Beam Summit 2022 held in Austin, Texas. The event concentrates around Apache Beam and the runners like Dataflow, Flink, Spark or Samza. Below you can find my overall impression of the conference and notes from several interesting sessions. If some aspect was particularly appeal...

FinOps for data pipelines on Google Cloud Platform

25 minute read

Do you check costs of the data pipelines in exactly the same way as you check overall health, latency or throughput? Nowadays, taking care of cost efficiency is an integral part of every data engineer job. I would like to share my own experiences with applying FinOps discipline in the organization within tens of da...

GCP Dataproc and Apache Spark tuning

8 minute read

Dataproc is a fully managed and highly scalable Google Cloud Platform service for running Apache Spark. However, “managed” doesn’t relieve you from the proper configuration to squeeze more processing power for less money. Today you will learn the easiest method to configure Dataproc and Spark together to get optimal...

GCP Cloud Composer and Apache Airflow tuning

15 minute read

I would love to only develop streaming pipelines but in reality some of them are still batch oriented. Today you will learn how to properly configure Google Cloud Platform scheduler – Cloud Composer.

Back to Top ↑

Performance

GCP Dataproc and Apache Spark tuning

8 minute read

Dataproc is a fully managed and highly scalable Google Cloud Platform service for running Apache Spark. However, “managed” doesn’t relieve you from the proper configuration to squeeze more processing power for less money. Today you will learn the easiest method to configure Dataproc and Spark together to get optimal...

GCP Cloud Composer and Apache Airflow tuning

15 minute read

I would love to only develop streaming pipelines but in reality some of them are still batch oriented. Today you will learn how to properly configure Google Cloud Platform scheduler – Cloud Composer.

Kafka Streams DSL vs processor API

29 minute read

Kafka Streams is a Java library for building real-time, highly scalable, fault-tolerant, distributed applications. The library is fully integrated with Kafka and leverages Kafka producer and consumer semantics (e.g: partitioning, rebalancing, data retention and compaction). What’s unique, the only dependency to r...

Artifactory Performance Tuning

4 minute read

Few years ago I participated in Kirk Pepperdine Java performance tuning training. One of the greatest technical training which I have ever been! And also great opportunity to visit Crete :-) Let’s check what I remember from the training … In this blog post I would like to show Artifactory memory utilization analy...

Back to Top ↑

Apache Kafka

Kafka Streams DSL vs processor API

29 minute read

Kafka Streams is a Java library for building real-time, highly scalable, fault-tolerant, distributed applications. The library is fully integrated with Kafka and leverages Kafka producer and consumer semantics (e.g: partitioning, rebalancing, data retention and compaction). What’s unique, the only dependency to r...

Apache BigData Europe 2016

10 minute read

Last week I attended Apache Big Data Europe held in Sevilla, Spain. The event concentrates around big data projects under Apache Foundation umbrella. Below you can find my overall impression on the conference and notes from several interesting sessions. The notes are presented as short checklists, if some aspect w...

Spark application assembly for cluster deployments

12 minute read

When I tried to deploy my first Spark application on a YARN cluster, I realized that there was no clear and concise instruction how to prepare the application for deployment. This blog post could be treated as missing manual on how to build Spark application written in Scala to get deployable binary.

Spark and Kafka integration patterns – part 2

21 minute read

In the world beyond batch, streaming data processing is a future of dig data. Despite of the streaming framework using for data processing, tight integration with replayable data source like Apache Kafka is often required. The streaming applications often use Apache Kafka as a data source, or as a destination for ...

Spark and Kafka integration patterns – part 1

less than 1 minute read

I published post on the allegro.tech blog, how to integrate Spark Streaming and Kafka. In the blog post you will find how to avoid java.io.NotSerializableException exception when Kafka producer is used for publishing results of the Spark Streaming processing.

Back to Top ↑

Spring

How to send email from JEE application

3 minute read

Sending email notifications from enterprise application is very common scenario. I know several methods to solve this puzzle, below you can find short summary. To send an email from the application at least SMTP server address must be configured. Because released application binary (e.g: WAR file) should be porta...

DDD Architecture Summary

5 minute read

In this blog post you can find my general rules for implementing system using Domain Driven Design. Don’t use them blindly but it’s good starting point for DDD practitioners.

Pure JEE or Spring Framework

5 minute read

During my career as J2EE and JEE software developer I have been trying to use pure JEE two o three times. And I decided to don’t repeat this exercise any more, it would be waste of my precious time. Below you can find short but quite comprehensive summary (based on Ilias Tsagklis):

Back to Top ↑

Apache Beam

Apache Beam SQL

36 minute read

If you are a BigData engineer who develops batch data pipelines, you might often hear that stream processing is the future. It unlocks the full potential of data that’s often unbounded in nature. You don’t need batch pipelines anymore, implement everything in a streaming fashion. There are plenty of modern and ea...

Apache Beam Summit 2022

15 minute read

Last week I virtually attended Apache Beam Summit 2022 held in Austin, Texas. The event concentrates around Apache Beam and the runners like Dataflow, Flink, Spark or Samza. Below you can find my overall impression of the conference and notes from several interesting sessions. If some aspect was particularly appeal...

Stream processing – part 2

23 minute read

This is the second part of the stream processing blog post series. In the first part I presented aggregations in a fixed, non-overlapping windows. Now you will learn dynamic aggregations in data-driven windows, for which the size of each window depends on the input data instead of a predefined time based pattern. A...

Stream processing – part 1

17 minute read

This is the very first part of the stream processing blog post series. From the series you will learn how to develop and test stateful streaming data pipelines.

Back to Top ↑

Java

Development Environment Setup

3 minute read

This document is a manual how to configure flexible development environment for Java, JavaScript, Ruby and Python - my primary set of tools. Even if the runtimes installation with apt-get seems to be a trivial task, there is limited control over installed version of the runtime. The goal is to configure environment ...

Pure JEE or Spring Framework

5 minute read

During my career as J2EE and JEE software developer I have been trying to use pure JEE two o three times. And I decided to don’t repeat this exercise any more, it would be waste of my precious time. Below you can find short but quite comprehensive summary (based on Ilias Tsagklis):

Artifactory Performance Tuning

4 minute read

Few years ago I participated in Kirk Pepperdine Java performance tuning training. One of the greatest technical training which I have ever been! And also great opportunity to visit Crete :-) Let’s check what I remember from the training … In this blog post I would like to show Artifactory memory utilization analy...

Back to Top ↑

Linux

Development Environment Setup

3 minute read

This document is a manual how to configure flexible development environment for Java, JavaScript, Ruby and Python - my primary set of tools. Even if the runtimes installation with apt-get seems to be a trivial task, there is limited control over installed version of the runtime. The goal is to configure environment ...

Virtual Box VDI maintenance

1 minute read

Virtual Disk Image (VDI) is a Virtual Box container format for guest hard disk. I found that VDI files on the host system grows over the time. If your VDI file on the host system is much bigger than used spaces on guest partition it is time for compaction:

Artifactory Performance Tuning

4 minute read

Few years ago I participated in Kirk Pepperdine Java performance tuning training. One of the greatest technical training which I have ever been! And also great opportunity to visit Crete :-) Let’s check what I remember from the training … In this blog post I would like to show Artifactory memory utilization analy...

Back to Top ↑

Dataproc

FinOps for data pipelines on Google Cloud Platform

25 minute read

Do you check costs of the data pipelines in exactly the same way as you check overall health, latency or throughput? Nowadays, taking care of cost efficiency is an integral part of every data engineer job. I would like to share my own experiences with applying FinOps discipline in the organization within tens of da...

GCP Dataproc and Apache Spark tuning

8 minute read

Dataproc is a fully managed and highly scalable Google Cloud Platform service for running Apache Spark. However, “managed” doesn’t relieve you from the proper configuration to squeeze more processing power for less money. Today you will learn the easiest method to configure Dataproc and Spark together to get optimal...

Back to Top ↑

Git

GitFlow step by step

4 minute read

Git Flow is a mainstream process for branch per feature development. Git Flow is the best method I’ve found for managing project developed by small to medium project teams. Before you start reading this post you should read two mandatory lectures:

How to convince your manager to adopt Git

4 minute read

Distributed Concurrent Versions Systems (DCVSs) like Git or Mercurial has changed software delivery processes significantly. I would not want to go back to the mid ages of Subversion, and I’m able to convince almost any developer to use DCVS. Convincing managers is much more tough task. Below I collected some insigh...

Back to Top ↑

Node.js

Development Environment Setup

3 minute read

This document is a manual how to configure flexible development environment for Java, JavaScript, Ruby and Python - my primary set of tools. Even if the runtimes installation with apt-get seems to be a trivial task, there is limited control over installed version of the runtime. The goal is to configure environment ...

Back to Top ↑

DDD

DDD Architecture Summary

5 minute read

In this blog post you can find my general rules for implementing system using Domain Driven Design. Don’t use them blindly but it’s good starting point for DDD practitioners.

Back to Top ↑

Apache Hadoop

Apache BigData Europe 2016

10 minute read

Last week I attended Apache Big Data Europe held in Sevilla, Spain. The event concentrates around big data projects under Apache Foundation umbrella. Below you can find my overall impression on the conference and notes from several interesting sessions. The notes are presented as short checklists, if some aspect w...

Long-running Spark Streaming jobs on YARN cluster

16 minute read

A long-running Spark Streaming job, once submitted to the YARN cluster should run forever until it’s intentionally stopped. Any interruption introduces substantial processing delays and could lead to data loss or duplicates. Neither YARN nor Apache Spark have been designed for executing long-running services. But th...

Back to Top ↑

Dataflow

Apache Beam SQL

36 minute read

If you are a BigData engineer who develops batch data pipelines, you might often hear that stream processing is the future. It unlocks the full potential of data that’s often unbounded in nature. You don’t need batch pipelines anymore, implement everything in a streaming fashion. There are plenty of modern and ea...

FinOps for data pipelines on Google Cloud Platform

25 minute read

Do you check costs of the data pipelines in exactly the same way as you check overall health, latency or throughput? Nowadays, taking care of cost efficiency is an integral part of every data engineer job. I would like to share my own experiences with applying FinOps discipline in the organization within tens of da...

Back to Top ↑

Python

Development Environment Setup

3 minute read

This document is a manual how to configure flexible development environment for Java, JavaScript, Ruby and Python - my primary set of tools. Even if the runtimes installation with apt-get seems to be a trivial task, there is limited control over installed version of the runtime. The goal is to configure environment ...

Back to Top ↑

Ruby

Development Environment Setup

3 minute read

This document is a manual how to configure flexible development environment for Java, JavaScript, Ruby and Python - my primary set of tools. Even if the runtimes installation with apt-get seems to be a trivial task, there is limited control over installed version of the runtime. The goal is to configure environment ...

Back to Top ↑

JBehave

Back to Top ↑

PHP

Programming language doesn’t matter

1 minute read

A few days ago I participated in quick presentation of significant e-commerce platform. The custom platform implemented mostly in PHP and designed as scalable and distributed system. And I was really impressed! Below you can find a short summary of chosen libraries, frameworks and tools.

Back to Top ↑

Cloud Composer

GCP Cloud Composer and Apache Airflow tuning

15 minute read

I would love to only develop streaming pipelines but in reality some of them are still batch oriented. Today you will learn how to properly configure Google Cloud Platform scheduler – Cloud Composer.

Back to Top ↑

Apache Airflow

GCP Cloud Composer and Apache Airflow tuning

15 minute read

I would love to only develop streaming pipelines but in reality some of them are still batch oriented. Today you will learn how to properly configure Google Cloud Platform scheduler – Cloud Composer.

Back to Top ↑

BigQuery

Back to Top ↑

Terraform

Back to Top ↑

SQL

Apache Beam SQL

36 minute read

If you are a BigData engineer who develops batch data pipelines, you might often hear that stream processing is the future. It unlocks the full potential of data that’s often unbounded in nature. You don’t need batch pipelines anymore, implement everything in a streaming fashion. There are plenty of modern and ea...

Back to Top ↑

Go

The Go programming language

3 minute read

Go is a modern programming language designed at Google to improve developer productivity. But how to improve productivity? By limiting engineers’ power and flexibility 😂

Back to Top ↑

Markdown

Technical writing techniques and tools

17 minute read

I’ve managed this blog since 2011. I’m also a primary contributor to the documentation managed by my team. For example: on-boarding guide for new colleagues or a tourist, technical guides, definition of done, architecture decision records and more. As you perhaps already know, software engineers don’t like writing t...

Back to Top ↑