Last week I attended Apache Big Data Europe held in Sevilla, Spain. The event concentrates around big data projects under Apache Foundation umbrella. Below you can find my overall impression on the conference and notes from several interesting sessions. The notes are presented as short checklists, if some aspect was particularly appealing I put the reference to supplementary materials.

Key takeaways

  • At Allegro we’re on track with our clickstream ingestion platform. Apache Kafka, Apache Avro, Apache Parquet, Apache Spark, Apache Hive and last but not least Apache Druid are key players for us, all hosted under Apache Foundation!
  • Apache Ignite might solve many performance issues in Spark jobs (shared RDD) but also in MR jobs (in-memory MR, HDFS cache). Has to be verified during the next Allegro internal hackathon, for sure.
  • Integration between Apache Hive and Apache Druid looks really promising. Both tools are very important for the Hadoop ecosystem, and they complement each other quite well.
  • Apache Calcite seems to be an important element in the Hadoop ecosystem. I hope that the gap to mature RDBMS optimizers will be somehow filled. It would be also great to see Spark Catalyst and Apache Calcite cooperation, keep your fingers crossed.
  • Stephan Even should improve Apache Flink keynotes if DataArtisans want to compete with DataBricks. Apache Flink architecture and overall design is awesome, FTW.
  • Queryable state in stream processing is quite an interesting idea to decrease latency in access to pre-aggregated data.
  • Apache Kudu and Apache Impala are on dead end, IMHO. The concept of executing analytical queries (fast SQL on Impala) against a whole set of raw data (kept in Kudu) is unrealistic. Cloudera gains mastery in keeping their own technologies alive (e.g: Apache Flume).
  • Apache Gearpump from Intel has lost its momentum. I really liked the idea of a distributed streaming framework built on Akka.
  • I was really surprised that the CSV format is heavily used and what’s even worse, conference speakers still talk about it.
  • Unfortunately there was no session about Apache Kylin, so sad.

“Apache Gearpump next-gen streaming engine” by Karol Brejna, Huafeng Wang (Intel)

“An overview on optimization in Apache Hive: past, present, future” by Jesus Rodriguez (HortonWorks)

  • Goals - sub-seconds latency, petabyte scale, ANSI SQL - never ending story.
  • Metastore is often a bottleneck (DataNucleus ORM). See also:
  • Metastore on HBase (Hive 2.x, alpha).
  • Cost Based Optimizer (Apache Calcite), integrated from 0.14, enabled by default from 2.0.
  • Optimizations: push down projections, push down filtering, join reordering (bushy joins), propagate projections, propagate filtering and more.
  • New feature: materialized views. Cons: up-to-date statistics, optimizer could use view instead of table. See more: HIVE-10459
  • Roadmap: optimizations based on CPU/MEM/IO costs.

“Distributed in-database machine learning with Apache MADlib” by Roman Shaposhnik (Pivotal)

“Interactive analytics at scale in Hive using Druid” by Jesus Rodriguez (HortonWorks)

  • It works, at least on Jesus’ laptop with an unreleased Hive version.
  • Druid is used as a library (more or less).
  • Druid indexing service isn’t used for indexing (regular Hive MR is used instead).
  • Druid broker is used for querying but there are plans to bypass the broker and access historical/real-time/indexing nodes directly. Right now the broker might be a bottleneck.
  • Apache Calcite is a key player in the integration.
  • Pros: schema discovery.
  • Cons: dims/metrics are inferred, right now there is no way to specify all important index details (e.g: time granularities).
  • Roadmap: push down more things into the Druid for better query performance. See more:

“Hadoop at Uber” by Mayank Basal (Uber)

  • Fast pace of changes, respect!
  • Shared cluster for batch jobs (YARN) and realtime jobs (Mesos) -> better resources utilization.
  • Apache Myriad (YARN on Mesos): static allocation, no global quotas, no global priorities and many more limitations and problems.
  • Unified Scheduler - just the name without any details yet.

“Spark Performance” by Tim Ellison (IBM)

Apache Calcite and Apache Geode by Christian Tzolov (Pivotal)

  • Apache Geode (AKA Gemfire) - distributed hashmap, consistent, transactional, partitioned, replicated, etc.
  • PDX serialization, on field level, type registry.
  • Nested regions.
  • Embeddable.
  • Object Query Language (OQL) ~ SQL.
  • Apache Calcite adapter (work in progress). The adapter might be implemented gradually (from in-memory enumerable to advanced pushdowns/optimizations and bindable generated code).
  • Linq4j ported from .NET.

“Data processing pipeline at Trivago” by Clemens Valiente (Trivago)

  • Separated data centers, REST collectors with HDD fallback, Apache Kafka, Camus, CSV.
  • Hive MR jobs prepare aggregates and data subsets for Apache Impala, Apache Oozie used as scheduler.
  • Problems with memory leaks in Apache Impala.
  • R/Shiny connected to Impala for analytical purposes.
  • Roadmap: Kafka Streams, Impala + Kudu, Kylin + HBase.
  • Interesting concept: direct access to Kafka Streams state (queryable RocksDB).
  • BigTop - way to build packages or set up big data servers and tools locally.
  • BigPetStore - generates synthetic data + sample stats calculation + sample recommendation (collaborative filtering).
  • MR, Spark, Flink implementations - nice method to learn Flink if you already know Spark.

“Introduction to TensorFlow” by Gemma Parreno

  • Global finalist of NASA Space App Challenge, congrats!
  • Fascinating session, but I’ve been totally lost - too much science :-(

“Shared Memory Layer and Faster SQL for Spark Applications” by Dmitriy Setrakyan (GridGain)

“Apache CouchDB” by Jan Lehnardt (Neighbourhoodie Software)

  • Master-master key-value store.
  • HTTP as protocol.
  • JSON as storage format.
  • Simplified vector clock for conflicts handling (document hashes, consistent across clusters).
  • Entertaining part about handling time in distributed systems, highly recommended.
  • Conflict-Free Replicated JSON Datatype was mentioned by Jan during talk, interesting.
  • Roadmap: HTTP2, pluggable storage format, improved protocol for high latency networks.

“Java memory leaks in modular environment” by Mark Thomas (Pivotal)

  • Did you remember “OutOfMemoryError: PermGen space” in Tomcat? It’s mostly not Tomcat’s fault.
  • You should always set -XX:MaxMetaspaceSize in production systems.
  • Excellent memory leaks analysis live demo using YourKit profiler (leaks in awt, java2d, rmi, xml).

“Children and the art of coding” by Sebastien Blanc (RedHat)

  • The best, entertaining session at the conference, IMHO! If you are a happy parent, you should watch Sebastien’s session (unfortunately sessions were not recorded, AFAIK).
  • Logo, Scratch, Groovy, Arduino and Makey Makey