Last week I virtually attended Apache Beam Summit 2022 held in Austin, Texas. The event concentrates around Apache Beam and the runners like Dataflow, Flink, Spark or Samza. Below you can find my overall impression of the conference and notes from several interesting sessions. If some aspect was particularly appealing I put the reference to supplementary materials.

Overview

The conference was a hybrid event, all sessions (except for workshops) were live-streamed for an online audience for free.

During the sessions you could ask the questions on the streaming platform, and the questions from the online audience were answered by the speakers. When the sessions had finished, they were available on the streaming platform. I was able to replay afternoon sessions the next day in the morning, very convenient for people in non US timezones like me.

For the two days the program was organized within three tracks, the third day was dedicated to the onsite workshops but there was also one track for the online audience.

Google’s investment on Beam, and internal use of Beam at Google

The keynote session was presented by Kerry Donny-Clark (manager of the Apache Beam team at Google).

  • Google centric keynotes, I missed some comparison between Apache Beam / Dataflow and other data processing industry leaders
  • A few words about new runners: Hazelcast, Ray, Dask
  • TypeScript SDK as an effect of internal Google hackathon
  • Google engagement in Apache Beam community

Beam team at Google

How the sausage gets made: Dataflow under the covers

The session was presented by Pablo Estrada (software engineer at Google, PMC member).

  • Excellent session, highly recommended if you deploy non-trivial Apache Beam pipelines on Dataflow runner
  • What does “exactly once” really mean in Apache Beam / Dataflow runner

Exactly once

  • Runner optimizations: fusion, flatten sinking, combiner lifting
  • Interesting papers, they help to understand Dataflow runner principles: Photon, MillWheel
  • Batch vs streaming: same controller, but data path is different. Different engineering teams, no plans to unify.
  • Batch bundle: 100–1000 elements vs. streaming bundle: < 10 elements (when pipeline is up-to-date)
  • Controlling batches: GroupIntoBatches
  • In streaming EACH element has a key (implicit or explicit)

Elements keys in streaming

  • Processing is serial for each key!

Elements keys in streaming

Introduction to performance testing in Apache Beam

The session was presented by Alexey Romanenko (principal software engineer at Talend, PMC member).

There are 4 performance tests categories in Apache Beam and the tests results are available at https://metrics.beam.apache.org

  1. IO integration tests
    • only for batch
    • only for Java SDK
    • not for all IOs
    • only read/write time metrics
  2. Core Beam tests
    • synthetic sources
    • ParDo, ParDo with SideInput, GBK, CoGBK
    • batch and streaming
    • all SDKs
    • Dataflow, Flink, Spark
    • many metrics
  3. Nexmark
    • batch and streaming
    • Java SDK
    • Dataflow, Flink, Spark
    • small scale, not industry standard (hard to compare)
  4. TPC-DS
    • industry standard;
    • SQL based
    • batch only
    • CSV and Parquet as input
    • Dataflow, Spark, Flink
    • 25 of 103 queries currently passes

How to benchmark your Beam pipelines for cost optimization and capacity planning

The session was presented by Roy Arsan (solution architect at Google).

  • How to verify performance and costs for different Dataflow setups using predefined scenarios?
  • Use PerfKit benchmark!
  • Runtime, CPU utilization and costs for different workers

WordCount PerfKit results for WordCount

  • Results for the streaming job scenario

WordCount PerfKit results for streaming job

Palo Alto Networks’ massive-scale deployment of Beam

The session was presented by Talat Uyarer (Senior Principal Software Engineer at Palo Alto Networks).

  • 10k streaming jobs!
  • Custom solution for creating / deploying / managing Dataflow jobs
  • Custom solution for managing Kafka clusters
  • Example job definition

Job definition

  • Custom sinks (unfortunately no details)
  • Rolling updates or fallback to drain/start (but with the job cold start to minimize downtime)
  • Many challenges with Avro schema evolution, see case study

New Avro Serialization And Deserialization In Beam SQL

The session was presented by Talat Uyarer (Senior Principal Software Engineer at Palo Alto Networks).

  • How can we improve latency while using BeamSQL and Avro payloads?

Beam SQL

Beam SQL

RunInference: Machine Learning Inferences in Beam

The session was presented by Andy Ye (software engineer at Google).

  • Reusable transform to run ML inferences, see documentation
  • Support for PyTorch, SciKit and TensorFlow
  • More in the future

RunInference future

Unified Streaming And Batch Pipelines At LinkedIn Using Beam

The session was presented by Shangjin Zhang (staff software engineer at LinkedIn) and Yuhong Cheng (software engineer at Linkedin).

LinkedIn kappa architecture

  • Streaming back-filling issues : hard to scale, flood on lookup tables, noisy neighbor to regular streaming pipelines

LinkedIn unified architecture

  • Single codebase: Samza runner for streaming, Spark runner for batch
  • Unified PTransform with expandStreaming and expandBatch methods
  • Unified table join: key lookup for streaming and coGroupByKey for batch
  • No windows in the pipeline (lucky them)

LinkedIn back-filling results

  • Job duration decreased from 450 minutes (streaming) to 25 minutes (batch)

Log ingestion and data replication at Twitter

The session was presented by Praveen Killamsetti and Zhenzhao Wang (staff engineers at Twitter) .

Batch ingestion architecture - Data Lifecycle Manager:

Twitter batch ingestion architecture

  • Many data sources: HDFS, GCS, S3, BigQuery, Manhattan
  • Versioned datasets with metadata layer
  • Replicates data to all data sources (seems to be very expensive)
  • GUI for configuration, DLM manages all the jobs
  • Plans: migrate 600+ custom data pipelines to DLM platform

Streaming log ingestion - Sparrow

Twitter log ingestion

  • E2E latency - up to 13 minutes instead of hours
  • One Beam Job and Pubsub Subscription per dataset per transformation (again seems to be very expensive)
  • Decreased job resource usage by 80–86% via removing shuffle in BigQuery IO connector
  • Reduced worker usage by 20% via data (batches) compression on Pubsub
  • Optimized schema conversion logic (Thrift -> Avro -> TableRow)

Detecting Change-Points in Real-Time with Apache Beam

The session was presented by Devon Peticolas (principal engineer at Oden Technologies).

  • Nice session with realistic, non-trivial streaming IOT scenarios
  • How Oden uses Beam: detecting categorical changes based on continues metrics

Oden use-case

  • Attempt 1: stateful DoFn - problem: out of order events (naive current and last elements’ comparison)

Oden stateful DoFn

  • Attempt 2: watermark triggered window - problem: lag for non-homogeneous data sources

Oden watermark trigger

  • Attempt 3: data triggered window - problem: sparse data when not all events are delivered

Oden data trigger

  • Smoothing, see the original session it’s hard to summarize concept in the single sentence

Oden smoothing

Worth seeing sessions

Tailoring pipelines at Spotify

The session was presented by Rickard Zwahlen (data engineer at Spotify).

  • Thousands of relatively similar data pipelines to manage
  • Reusable Apache Beam jobs packed as Docker images and managed by Backstage
  • Typical use cases: data profiling, anomaly detection
  • More complex use cases - custom pipeline development

Houston, we’ve got a problem: 6 principles for pipelines design taken from the Apollo missions

The session was presented by Israel Herraiz and Paul Balm (strategic cloud engineers at Google).

  • Nothing spectacular but worth seeing, below you can find the agenda

6 principles

Strategies for caching data in Dataflow using Beam SDK

The session was presented by Zeeshan (cloud engineer).

  • Side input (for streaming engine stored in BigTable)
  • Util apache_beam.utils.shared.Shared (python only)
  • Stateful DoFn (per key and window), define elements TTL for the global window
  • External cache

Migration Spark to Apache Beam/Dataflow and hexagonal architecture + DDD

The session was presented by Mazlum Tosun.

  • DDD/Hexagonal architecture pipeline code organization
  • Static dependency injection with Dagger
  • Test data defined as JSON files

I would say: overkill … I’m going to write the blog post on how to achieve testable Apache Beam pipelines aligned to DDD architecture in a simpler way :)

Error handling with Apache Beam and Asgarde library

The session was presented by Mazlum Tosun.

Asgard error handling for Java: Asgard error handling for Java

Asgard error handling for Python: Asgard error handling for Python

Beam as a High-Performance Compute Grid

The session was presented by Peter Coyle (Head of Risk Technology Engineering Excellence at HSBC) and Raj Subramani.

  • Risk management system for investment banking at HSBC
  • Flink runner & Dataflow runner
  • Trillion evaluations (whatever it means)
  • Choose the most efficient worker type for cost efficiency
  • Manage shuffle slots quotas for batch Dataflow
  • Instead of reservation (not available for Dataflow) define disaster recovery scenario and fallback to more common worker type

Dataflow disaster recovery planning

Optimizing a Dataflow pipeline for cost efficiency: Lessons learned at Orange

The session was presented by Jérémie Gomez (cloud consultant at Google) and Thomas Sauvagnat (data engineer at Orange).

  • How to store Orange LiveBox data into BigQuery (33TB of billed bytes daily)?
  • Initial architecture: data on GCS, on-finalize trigger, Dataflow streaming job notified from Pubsub, read files from GCS and store to BQ
  • It isn’t cheap stuff ;)
  • The optimization plan

Optimization plan

  • Storage Write API instead on Streaming Inserts
  • BigQuery batch loads didn’t work as well
  • N2 workers instead of N1
  • Smaller workers (n2-standard-8 instead of n2-standard-16)
  • Disabled autoscaling (the pipeline latency isn’t so important, Dataflow autoscaler policy can not be configured)
  • USE BATCH INSTEAD OF STREAMING if feasible

Relational Beam: Process columns, not rows!

The session was presented by Andrew Pilloud and Brian Hulette (software engineers at Google, Apache Beam committers)

  • Beam isn’t relational, is row oriented, data is represented as bytes
  • What’s needed: data schema, metadata of computation

Relational

  • Batched DoFn (doesn’t exist yet) https://s.apache.org/batched-dofns
  • Projection pushdown (currently for BigQueryIO.TypedRead only; 2.38 batch, 2.41 streaming)
  • Don’t use @ProcessContext - it deserializes everything, use @FieldAccess instead
  • Use relational transforms: beam.Select, beam.GroupBy

Scaling up pandas with the Beam DataFrame API

The session was presented by Brian Hulette (software engineer at Google, Apache Beam committer).

  • Nice introduction to Pandas
  • DataframeTransform and DeferredDataFrame

Dataframe transform

  • Dataframe code → Expression tree → Beam pipeline
  • Compliance with Pandas is limited as for now
  • 14% of Pandas operations are order-sensitive (hard to distribute)
  • Pandas windowing operations transformed into Beam windowing would be a real differentiator in the future

Improving Beam-Dataflow Pipelines For Text Data Processing

The session by Sayak Paul and Nilabhra Roy Chowdhury (ML engineers at Carted)

  • Use variable sequence lengths for sentence encoder model
  • Sort data before batching
  • Separate tokenization and encoding steps to achieve full parallelism
  • More details in the blog post

Pipeline optimization at Carted

Summary

I would thank organizers and speakers for the Beam Summit conference. The project needs such events to share the knowledge of how leading organizations use Apache Beam, how to apply advanced data processing techniques and how the runners execute the pipelines.

Below you could find a few takeaways from the sessions:

  • Companies like Spotify, Palo Alto Networks or Twitter develop custom, fully managed, declarative layers on top of Apache Beam. To run thousands of data pipelines without coding and excessive operations.
  • Streaming pipelines are sexy but much more expensive (and complex) than batch pipelines.
  • Use existing tools like PerfKit for performance/cost evaluation. Check Apache Beam performance metrics to compare different runners (for example if you want to migrate from JDK 1.8 to JDK 11).
  • Understand the framework and the runner internals, unfortunately it’s necessary for troubleshooting. Be aware that Dataflow batch and streaming engines are developed by different engineering teams.
  • Cloud resources aren’t infinite, be prepared and define disaster recovery scenarios (fallback to more common machine types).
  • Future is bright: Beam SQL, integration with ML frameworks, column oriented vectorized execution, new runners

Comments