This week, I attended Flink Forward in Berlin, Germany. The event celebrated the 10th anniversary of Apache Flink. Below, you can find my overall impressions of the conference and notes from several interesting sessions. If an aspect was particularly appealing, I included a reference to supplementary materials.

Intro

I don’t use Flink on a daily basis, but I hoped to gain some inspiration that I could apply to my real-time data pipelines running on GCP Dataflow.

Keynotes

  • Flink 2.0 announced during the first day of the conference, perfect timing
  • Stephan Ewen with Feng Wang presented 15 years of the project history Apache Flink, which emerged around 2014, originally started as the Stratosphere project at German universities in 2009

Keynotes

  • Kafka fails short in real-time streaming analytics

Keynotes

  • Truly unified batch and streaming with Apache Paimon: Streaming Lakehouse.
  • Fluss: Streaming storage for next-gen data analytics, it’s going to be open-sourced soon. Apologies for the low quality picture.

Keynotes

Disaggregated state storage

  • Streaming Lakehouse: An enabler for unified batch and streaming in an innovative way

Streaming lakehouse

  • Breaking changes: While it’s unfortunate that the Scala API is deprecated, there’s a new extension available: Flink Scala API.
  • PMC members mentioned that they’re going to modernize the Java Streaming API soon. It’s the oldest and hardest-to-maintain part of the Flink API.

Breaking changes

  • Autoscaling example, simplified but self-explanatory.

Autoscaling example

  • Challenges: Unfortunately, the speaker struggled with time management and couldn’t delve into the details. The key lesson for me: don’t scale up if there is no effect of scaling. Dataflow engineering team - can you hear me?

Autoscaling challenges

  • Memory management in Flink, for me looks like a configuration and tuning nightmare. Flink autoscaling should help, see: FLIP-271.

Memory model

  • The best session of the first day, in my opinion!
  • I’m sure that Ben Augarten from Stripe knows how to manage Flink clusters and jobs at scale.

Stripe

  • With tight SLOs, there isn’t time for manual operations. If a job fails, roll back using the previously saved job graph. How do you decide if a job fails in a generic way? You should listen to the session.

Stripe

Stripe

  • Shared Zookeepers and shared Flink clusters can lead to issues with noisy neighbors and the propagation of failures. Extra operational costs are worth it to support system stability and performance.

Visually diagnosing operator state problems

Datorios

  • Another session focused on current Kafka limitations
  • Mitigation strategies for noisy neighbors in Kafka: quotas and cluster mirroring
  • Introducing WarpStream: Confluent has acquired WarpStream
  • Stateless, leaderless brokers
  • Using object storage for state management: expect higher latency, but it should be acceptable for most use cases

WarpStream

  • A new business idea from one of the Flink founders: shift the focus from analytical to transactional processing.
  • Apply resilience and consistency lessons learned from building Flink to distributed transactional, RPC-based applications.

Restate Intro

Restate Durable Execution

  • The problem, local state doesn’t fit cloud native architecture.

Large state

  • ForSt (for streaming) DB architecture

ForSt DB architecture

  • Performance dropped 100x when RocksDB was replaced with object store as is.
  • The new async API requires changes in all Flink operators!

State async API

  • Asynchronous improves performance but introduces new challenge: ordering

Ordering

  • Slower than local RocksDB, but performance looks promising

Benchmark

  • The most entertaining session of the conference, in my opinion
  • How to adopt real-time analysis for non-technical users?

Real-time analysis adoption

  • Steffen Hoellinger invited us to conduct a POC together with Airy

Copilot architecture

  • Hmm, some technical knowledge is still required 😀

Sample session

  • Key lesson: context is much more important than model
  • Keep small workspaces to avoid hallucinations

Context vs Model

Materialized Table - Making Your Data Pipeline Easier

  • The most eye opening session
  • Batch, incremental and real-time unification
  • Backfill

Materialized Table

  • Freshness vs Cost
  • Apply SET FRESHNESS = INTERVAL 1 HOUR and framework will do the rest
  • Support for most SQL queries (without ORDER BY cause)

Freshness vs Cost

  • Cool demo, materialized view freshness changed, Flink jobs rescheduled and BI dashboard updated in-place. Yet another scenario for backfill.
  • Community version coming soon, see FLIP-435.

Event tracing

  • Session based on IoT vehicle data in Mercedes-Benz.
  • Apply OpenTelemetry for real-time data pipelines
  • Tracing events sampling to avoid negative performance impact
  • Kafka sources: extract tracing from headers
  • Flink steps: attach spans to all events
  • Kafka sinks: add tracing to headers
  • In-summary: a lot of extra work

Telemetry

Summary

Attending the conference was a valuable experience, offering deep insights into the latest developments in Apache Flink. Here are my key takeaways:

  • Listening to sessions about the challenges of Flink deployment and operations from the trenches made me appreciate the simplicity of Dataflow even more.
  • I now believe in the potential of truly unified batch and streaming processing. FLIP-435 and the streaming lakehouse give hope that SET FRESHNESS could switch processing modes from batch, through incremental, to real-time.
  • For high adoption of real-time analytics, consider using GenAI to hide the underlying data pipelines complexity.
  • My general impression is that Kafka’s limitations in the cloud-native era have been confirmed.

Summary

Updated:

Comments