Unfortunately many data engineers aren’t aware of TDD or its benefits. They change pipeline code, build an assembly, deploy the assembly to the cluster, run the pipeline and manually verify results using SQL. How much time does it take? I would guess at least 30 minutes.
What if you reduce development cycle time from 30 minutes to 10 seconds and deliver code of higher quality?
Today I will develop batch data pipelines from scratch using TDD approach. Data pipeline which calculates statistics from toll booth entries you pass on highways, bridges, or tunnels. For example:
I drew inspiration for the design of this data pipeline from the Azure documentation titled Build an IoT solution by using Stream Analytics. Instead of relying on SQL, my choice was to use Apache Beam, Spotify Scio and a set of custom-built utilities.
Command line arguments constitute the data pipeline API. Design such API and implement job end-to-end tests first. For calculating toll booth statistics, data pipeline takes 4 parameters:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import org.scalatest.flatspec.AnyFlatSpec
import org.scalatest.matchers.should.Matchers
import com.spotify.scio.testing.JobTest
import org.mkuthan.streamprocessing.test.scio.JobTestScioContext
class TollBoothEntryStatsJobTest extends AnyFlatSpec with Matchers with JobTestScioContext {
"Toll job" should "run in the batch mode" in {
JobTest[TollBoothEntryStatsJob.type]
.args(
"--effectiveDate=2014-09-10",
"--entryTable=toll.entry",
"--entryStatsHourlyTable=toll.entry_stats_hourly",
"--entryStatsDailyTable=toll.entry_stats_daily"
)
.run()
}
}
Run the test:
1
2
3
4
5
6
sbt> testOnly *TollBoothEntryStatsJobTest
[error] TollBoothEntryStatsJobTest.scala:12:13: not found: value TollBoothEntryStatsJob
[error] JobTest[TollBoothEntryStatsJob.type]
[error] ^
[error] one error found
[error] (Test / compileIncremental) Compilation failed
Test fails because there is no TollBoothEntryStatsJob
class with data pipeline implementation yet.
Implement the skeleton of the job, it does literally nothing but fulfill the command line parameters contract.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import com.spotify.scio.ContextAndArgs
import org.joda.time.LocalDate
object TollBoothEntryStatsJob {
def main(mainArgs: Array[String]): Unit = {
val (sc, args) = ContextAndArgs(mainArgs)
val effectiveDate = LocalDate.parse(args.required("effectiveDate"))
val entryTable = args.required("entryTable")
val entryStatsHourlyTable = args.required("entryStatsHourlyTable")
val entryStatsDailyTable = args.required("entryStatsDailyTable")
sc.run()
}
}
Run the test again:
1
2
3
4
5
6
7
8
9
sbt> testOnly *TollBoothEntryStatsJobTest
[info] TollBoothEntryStatsJobTest:
[info] Toll job
[info] - should run in the batch mode
[info] Run completed in 3 seconds, 877 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
End-to-end test using Apache Beam direct runner executed in 3.9 seconds 🚀
TollBoothEntryStatsJob
needs input data, define type for toll booth entry records:
1
2
3
4
5
6
7
8
9
10
11
import com.spotify.scio.bigquery.types.BigQueryType
object TollBoothEntry {
@BigQueryType.toTable
final case class Record(
id: String,
entry_time: Instant,
license_plate: String,
toll: BigDecimal,
)
}
Test fixture with anyTollBoothEntryRecord
prototype:
1
2
3
4
5
6
7
8
trait TollBoothEntryFixture {
final val anyTollBoothEntryRecord = TollBoothEntry.Record(
id = "1",
entry_time = Instant.parse("2014-09-10T12:01:00Z"),
license_plate = "JNB 7001",
toll = BigDecimal(7),
)
}
Use anyTollBoothEntryRecord
as an input for the job.
Keep it simple, it’s a happy path scenario with a single vehicle crossing the toll booth.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
"Toll job" should "run in the batch mode" in {
JobTest[TollBoothEntryStatsJob.type]
.args(
"--effectiveDate=2014-09-10",
"--entryTable=toll.entry",
"--entryStatsHourlyTable=toll.entry_stats_hourly",
"--entryStatsDailyTable=toll.entry_stats_daily"
)
.input(
CustomIO[TollBoothEntry.Record]("entry-table-id"),
Seq(anyTollBoothEntryRecord)
)
.run()
}
Run the test:
1
2
3
4
5
6
7
8
9
10
11
sbt> testOnly *TollBoothEntryStatsJobTest
[info] TollBoothEntryStatsJobTest:
[info] Toll job
[info] - should run in the batch mode *** FAILED ***
[info] java.lang.IllegalArgumentException: requirement failed:
Unmatched test input: CustomIO(entry-table-id)
[info] Run completed in 3 seconds, 891 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0
[info] *** 1 TEST FAILED ***
Test fails because TollBoothEntryStatsJob
doesn’t use entry-table-id
input.
Implement reading toll booth entry records from BigQuery using row restriction to fetch only a single partition of data.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import org.mkuthan.streamprocessing.infrastructure._
import org.mkuthan.streamprocessing.infrastructure.bigquery.BigQueryTable
import org.mkuthan.streamprocessing.infrastructure.bigquery.RowRestriction
import org.mkuthan.streamprocessing.infrastructure.bigquery.StorageReadConfiguration
import org.mkuthan.streamprocessing.infrastructure.common.IoIdentifier
def main(mainArgs: Array[String]): Unit = {
val (sc, args) = ContextAndArgs(mainArgs)
val effectiveDate = LocalDate.parse(args.required("effectiveDate"))
val entryTable = args.required("entryTable")
val entryStatsHourlyTable = args.required("entryStatsHourlyTable")
val entryStatsDailyTable = args.required("entryStatsDailyTable")
val entryRecords = sc.readFromBigQuery(
IoIdentifier[TollBoothEntry.Record]("entry-table-id"),
BigQueryTable[TollBoothEntry.Record](entryTable),
StorageReadConfiguration().withRowRestriction(
RowRestriction.TimestampColumnRestriction("entry_time", effectiveDate)
)
)
sc.run()
}
Run the test again:
1
2
3
4
5
6
7
8
9
sbt> testOnly *TollBoothEntryStatsJobTest
[info] TollBoothEntryStatsJobTest:
[info] Toll job
[info] - should run in the batch mode
[info] Run completed in 4 seconds, 497 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
TollBoothEntryStatsJob
produces results, define type for toll booth stats records:
1
2
3
4
5
6
7
8
9
10
11
12
13
import com.spotify.scio.bigquery.types.BigQueryType
object TollBoothStats {
@BigQueryType.toTable
final case class Record(
created_at: Instant,
id: String,
count: Int,
total_toll: BigDecimal,
first_entry_time: Instant,
last_entry_time: Instant
)
}
Test fixture with anyTollBoothStatsRecord
prototype:
1
2
3
4
5
6
7
8
9
10
trait TollBoothStatsFixture {
final val anyTollBoothStatsRecord = TollBoothStats.Record(
created_at = Instant.EPOCH,
id = "1",
total_toll = BigDecimal(7),
count = 1,
first_entry_time = Instant.parse("2014-09-10T12:01:00.000Z"),
last_entry_time = Instant.parse("2014-09-10T12:01:00.000Z")
)
}
Use anyTollBoothStatsRecord
to verify output of the job.
Again, keep the happy path scenario simple with a single vehicle crossing the toll booth entry.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
"Toll job" should "run in the batch mode" in {
JobTest[TollBatchJob.type]
.args(
"--effectiveDate=2014-09-10",
"--entryTable=toll.entry",
"--entryStatsHourlyTable=toll.entry_stats_hourly",
"--entryStatsDailyTable=toll.entry_stats_daily"
)
.input(
CustomIO[TollBoothEntry.Record]("entry-table-id"),
Seq(anyTollBoothEntryRecord)
)
.output(CustomIO[TollBoothStats.Record]("entry-stats-hourly-table-id")) { results =>
val endOfHourlyWindow = Instant.parse("2014-09-10T12:59:59.999Z")
results should containElements(
anyTollBoothStatsRecord.copy(created_at = endOfHourlyWindow)
)
}
.output(CustomIO[TollBoothStats.Record]("entry-stats-daily-table-id")) { results =>
val endOfDailyWindow = Instant.parse("2014-09-10T23:59:59.999Z")
results should containElements(
anyTollBoothStatsRecord.copy(created_at = endOfDailyWindow)
)
}
.run()
}
Run the test:
1
2
3
4
5
6
7
8
9
10
11
12
13
sbt> testOnly *TollBoothEntryStatsJobTest
[info] TollBoothEntryStatsJobTest:
[info] Toll job
[info] - should run in the batch mode *** FAILED ***
[info] java.lang.IllegalArgumentException: requirement failed:
Unmatched test output:
CustomIO(entry-stats-hourly-table-id),
CustomIO(entry-stats-daily-table-id)
[info] Run completed in 8 seconds, 704 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0
[info] *** 1 TEST FAILED ***
Test fails because TollBoothEntryStatsJob
doesn’t produce results into entry-stats-hourly-table-id
and entry-stats-daily-table-id
outputs.
Write toll booth stats records to BigQuery table.
Don’t implement toll booth statistics at this stage, just reuse a test fixture as a job result.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import org.mkuthan.streamprocessing.infrastructure._
import org.mkuthan.streamprocessing.infrastructure.bigquery.BigQueryPartition
import org.mkuthan.streamprocessing.infrastructure.bigquery.BigQueryTable
import org.mkuthan.streamprocessing.infrastructure.bigquery.RowRestriction
import org.mkuthan.streamprocessing.infrastructure.bigquery.StorageReadConfiguration
import org.mkuthan.streamprocessing.infrastructure.common.IoIdentifier
def main(mainArgs: Array[String]): Unit = {
val (sc, args) = ContextAndArgs(mainArgs)
val effectiveDate = LocalDate.parse(args.required("effectiveDate"))
val entryTable = args.required("entryTable")
val entryStatsHourlyTable = args.required("entryStatsHourlyTable")
val entryStatsDailyTable = args.required("entryStatsDailyTable")
val entryRecords = sc.readFromBigQuery(
IoIdentifier[TollBoothEntry.Record]("entry-table-id"),
BigQueryTable[TollBoothEntry.Record](entryTable),
StorageReadConfiguration().withRowRestriction(
RowRestriction.TimestampColumnRestriction("entry_time", effectiveDate)
)
)
// TODO: calculate toll booth entry stats
val endOfHourlyWindow = Instant.parse("2014-09-10T12:59:59.999Z")
sc
.parallelize(Seq(anyTollBoothStatsRecord.copy(created_at = endOfHourlyWindow)))
.writeBoundedToBigQuery(
IoIdentifier("entry-stats-hourly-table-id"),
BigQueryPartition.daily(entryStatsHourlyTable, effectiveDate)
)
val endOfDailyWindow = Instant.parse("2014-09-10T23:59:59.999Z")
sc
.parallelize(Seq(anyTollBoothStatsRecord.copy(created_at = endOfDailyWindow)))
.writeBoundedToBigQuery(
IoIdentifier("entry-stats-daily-table-id"),
BigQueryPartition.daily(entryStatsDailyTable, effectiveDate)
)
sc.run()
}
Run the test again:
1
2
3
4
5
6
7
8
9
sbt> testOnly *TollBoothEntryStatsJobTest
[info] TollBoothEntryStatsJobTest:
[info] Toll job
[info] - should run in the batch mode
[info] Run completed in 6 seconds, 770 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
Define domain type TollBoothEntry
using rich types:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
final case class TollBoothId(id: String) {
require(!id.isEmpty, "Toll booth id is empty")
}
final case class LicensePlate(number: String) {
require(!number.isEmpty, "License plate number is empty")
}
final case class TollBoothEntry(
id: TollBoothId,
entryTime: Instant,
licensePlate: LicensePlate,
toll: BigDecimal
) {
require(toll >= 0, s"Toll is negative: $toll")
}
Test fixture with anyTollBoothStatsRecord
prototype:
1
2
3
4
5
6
7
8
trait TollBoothEntryFixture {
final val anyTollBoothEntry = TollBoothEntry(
id = TollBoothId("1"),
entryTime = Instant.parse("2014-09-10T12:01:00Z"),
toll = BigDecimal(7),
licensePlate = LicensePlate("JNB 7001")
)
}
At this moment don’t touch TollBoothEntryStatsJobTest
but define separate test
for mapping between TollBoothEntry.Record
and TollBoothEntry
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import org.scalatest.flatspec.AnyFlatSpec
import org.scalatest.matchers.should.Matchers
import org.mkuthan.streamprocessing.test.scio.syntax._
import org.mkuthan.streamprocessing.test.scio.TestScioContext
class TollBoothEntryTest extends AnyFlatSpec with Matchers
with TestScioContext
with TollBoothEntryFixture {
import TollBoothEntry._
behavior of "TollBoothEntry"
it should "decode valid record into TollBoothEntry" in runWithScioContext { sc =>
val inputs = boundedTestCollectionOf[TollBoothEntry.Record]
.addElementsAtMinimumTime(anyTollBoothEntryRecord)
.advanceWatermarkToInfinity()
val results = decodeRecord(sc.testBounded(inputs))
results.withTimestamp should containElementsAtTime(
anyTollBoothEntry.entryTime,
anyTollBoothEntry
)
}
it should "throw an exception for invalid record" in {
val thrown = the[RuntimeException] thrownBy {
runWithScioContext { sc =>
val invalidRecord = anyTollBoothEntryRecord.copy(id = "")
val inputs = boundedTestCollectionOf[TollBoothEntry.Record]
.addElementsAtMinimumTime(invalidRecord)
.advanceWatermarkToInfinity()
decodeRecord(sc.testBounded(inputs))
}
}
thrown.getMessage should include("Toll booth id is empty")
}
}
Run the test:
1
2
3
4
5
6
7
sbt:tollDomain> testOnly *TollBoothEntryTest
[error] TollBoothEntryTest.scala:59:19: not found: value decodeRecord
[error] val results = decodeRecord(sc.testBounded(inputs))
[error] ^
[error] TollBoothEntryTest.scala:73:9: not found: value decodeRecord
[error] decodeRecord(sc.testBounded(inputs))
[error] ^
Implement decodeRecord
function to fulfill the contract defined in tests:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
object TollBoothEntry {
def decodeRecord(input: SCollection[TollBoothEntry.Record]): SCollection[TollBoothEntry] =
input.transform { in =>
in
.map(record => fromRecord(record))
.timestampBy(boothEntry => boothEntry.entryTime)
}
private def fromRecord(record: TollBoothEntry.Record) =
TollBoothEntry(
id = TollBoothId(record.id),
entryTime = record.entry_time,
licensePlate = LicensePlate(record.license_plate),
toll = record.toll
)
}
Run the test again:
1
2
3
4
5
6
7
8
9
10
sbt:tollDomain> testOnly *TollBoothEntryTest
[info] TollBoothEntryTest:
[info] TollBoothEntry
[info] - should decode valid record into TollBoothEntry
[info] - should throw an exception for invalid record
[info] Run completed in 8 seconds, 362 milliseconds.
[info] Total number of tests run: 2
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 2, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
It’s time to implement core domain logic.
Define domain type TollBoothStats
using rich types:
1
2
3
4
5
6
7
8
9
10
11
12
final case class TollBoothStats(
id: TollBoothId,
count: Int,
totalToll: BigDecimal,
firstEntryTime: Instant,
lastEntryTime: Instant
) {
private def before(other: TollBoothStats): Boolean =
firstEntryTime.isBefore(other.firstEntryTime)
private def after(other: TollBoothStats): Boolean =
lastEntryTime.isAfter(other.lastEntryTime)
}
Test fixture with anyTollBoothStats
prototype:
1
2
3
4
5
6
7
8
9
trait TollBoothStatsFixture {
final val anyTollBoothStats = TollBoothStats(
id = TollBoothId("1"),
totalToll = BigDecimal(7),
count = 1,
firstEntryTime = Instant.parse("2014-09-10T12:01:00.000Z"),
lastEntryTime = Instant.parse("2014-09-10T12:01:00.000Z")
)
}
At this moment don’t touch TollBoothEntryStatsJobTest
but define separate test
for calculating TollBoothStats
from TollBoothEntry
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
class TollBoothStatsTest extends AnyFlatSpec with Matchers
with TestScioContext
with TollBoothEntryFixture
with TollBoothStatsFixture {
import TollBoothStats._
private val FiveMinutes = Duration.standardMinutes(5)
private val DefaultWindowOptions = WindowOptions()
behavior of "TollBoothStats"
it should "calculate statistics in fixed window" in runWithScioContext { sc =>
val tollBoothId1 = TollBoothId("1")
val tollBoothId2 = TollBoothId("2")
val entry1 = anyTollBoothEntry.copy(
id = tollBoothId1,
entryTime = Instant.parse("2014-09-10T12:01:00.000Z"),
toll = BigDecimal(2)
)
val entry2 = anyTollBoothEntry.copy(
id = tollBoothId1,
entryTime = Instant.parse("2014-09-10T12:01:30.000Z"),
toll = BigDecimal(1)
)
val entry3 = anyTollBoothEntry.copy(
id = tollBoothId2,
entryTime = Instant.parse("2014-09-10T12:04:00.000Z"),
toll = BigDecimal(4)
)
val inputs = boundedTestCollectionOf[TollBoothEntry]
.addElementsAtTime(entry1.entryTime, entry1)
.addElementsAtTime(entry2.entryTime, entry2)
.addElementsAtTime(entry3.entryTime, entry3)
.advanceWatermarkToInfinity()
val results = calculateInFixedWindow(sc.testBounded(inputs), FiveMinutes, DefaultWindowOptions)
results.withTimestamp should inOnTimePane("2014-09-10T12:00:00Z", "2014-09-10T12:05:00Z") {
containElementsAtTime(
"2014-09-10T12:04:59.999Z",
anyTollBoothStats.copy(
id = tollBoothId1,
count = 2,
totalToll = BigDecimal(2 + 1),
firstEntryTime = entry1.entryTime,
lastEntryTime = entry2.entryTime
),
anyTollBoothStats.copy(
id = tollBoothId2,
count = 1,
totalToll = BigDecimal(4),
firstEntryTime = entry3.entryTime,
lastEntryTime = entry3.entryTime
)
)
}
}
it should "calculate statistics in fixed window for late entries" in runWithScioContext { sc =>
val onTimeEntry1 = anyTollBoothEntry.copy(
entryTime = Instant.parse("2014-09-10T12:01:00Z"),
toll = BigDecimal(2)
)
val onTimeEntry2 = anyTollBoothEntry.copy(
entryTime = Instant.parse("2014-09-10T12:02:00Z"),
toll = BigDecimal(3)
)
val lateEntry = anyTollBoothEntry.copy(
entryTime = Instant.parse("2014-09-10T12:03:00Z"),
toll = BigDecimal(1)
)
val inputs = unboundedTestCollectionOf[TollBoothEntry]
.addElementsAtTime(onTimeEntry1.entryTime, onTimeEntry1)
.addElementsAtTime(onTimeEntry2.entryTime, onTimeEntry2)
.advanceWatermarkTo("2014-09-10T12:05:00Z")
.addElementsAtTime(lateEntry.entryTime, lateEntry)
.advanceWatermarkToInfinity()
val windowOptions = WindowOptions(
allowedLateness = Duration.standardMinutes(2),
accumulationMode = AccumulationMode.DISCARDING_FIRED_PANES
)
val results = calculateInFixedWindow(sc.testUnbounded(inputs), FiveMinutes, windowOptions)
val windowStart = "2014-09-10T12:00:00Z"
val windowEnd = "2014-09-10T12:05:00Z"
val recordTimestamp = Instant.parse("2014-09-10T12:04:59.999Z")
results.withTimestamp should inOnTimePane(windowStart, windowEnd) {
containElementsAtTime(
recordTimestamp,
anyTollBoothStats.copy(
count = 2,
totalToll = BigDecimal(2 + 3),
firstEntryTime = onTimeEntry1.entryTime,
lastEntryTime = onTimeEntry2.entryTime
)
)
}
results.withTimestamp should inLatePane(windowStart, windowEnd) {
containElementsAtTime(
recordTimestamp,
anyTollBoothStats.copy(
count = 1,
totalToll = BigDecimal(1),
firstEntryTime = lateEntry.entryTime,
lastEntryTime = lateEntry.entryTime
)
)
}
}
}
Look at Stream Processing – part1 and Stream Processing – part2 to learn how to test complex domain logic
Run tests:
1
2
3
4
5
6
7
sbt:tollDomain> testOnly *TollBoothStatsTest
[error] TollBoothStatsTest.scala:56:19: not found: value calculateInFixedWindow
[error] val results = calculateInFixedWindow(sc.testBounded(inputs), FiveMinutes, DefaultWindowOptions)
[error] ^
[error] TollBoothStatsTest.scala:107:19: not found: value calculateInFixedWindow
[error] val results = calculateInFixedWindow(sc.testUnbounded(inputs), FiveMinutes, windowOptions)
[error]
Implement calculateInFixedWindow
function to fulfill the contract defined in tests:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
object TollBoothStats {
def calculateInFixedWindow(
input: SCollection[TollBoothEntry],
windowDuration: Duration,
windowOptions: WindowOptions
): SCollection[TollBoothStats] =
input.transform { in =>
in
.map(fromBoothEntry)
.sumByKeyInFixedWindow(windowDuration, windowOptions)
}
private def fromBoothEntry(boothEntry: TollBoothEntry) = TollBoothStats(
id = boothEntry.id,
count = 1,
totalToll = boothEntry.toll,
firstEntryTime = boothEntry.entryTime,
lastEntryTime = boothEntry.entryTime
)
implicit val TollBoothStatsSumByKey = SumByKey.create(
keyFn = _.id.id,
plusFn = (x, y) =>
x.copy(
count = x.count + y.count,
totalToll = x.totalToll + y.totalToll,
firstEntryTime = if (x.before(y)) x.firstEntryTime else y.firstEntryTime,
lastEntryTime = if (x.after(y)) x.lastEntryTime else y.lastEntryTime
)
)
}
Run tests again:
1
2
3
4
5
6
7
8
9
10
sbt> testOnly *TollBoothStatsTest
[info] TollBoothStatsTest:
[info] TollBoothStats
[info] - should calculate statistics in fixed window
[info] - should calculate statistics in fixed window for late entries
[info] Run completed in 7 seconds, 553 milliseconds.
[info] Total number of tests run: 2
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 2, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
Well done on successfully implementing and testing intricate domain logic without the need for a single data pipeline deployment. Consider this scenario: if you had to repeat the testing process 50 times before achieving the final implementation, it would consume 1500 minutes or 25 hours of your time during a 30-minute development cycle. However, when working within a 7.5-second development cycle, it would only take 375 seconds or 6.25 minutes.
Dear data engineer, can you now appreciate the significance of Test-Driven Development (TDD) in your work?
We’re almost done, add a test scenario for TollBoothStats
encoding into TollBoothStats.Record
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
class TollBoothStatsTest extends AnyFlatSpec with Matchers
with TestScioContext
with TollBoothEntryFixture
with TollBoothStatsFixture {
import TollBoothStats._
behavior of "TollBoothStats"
(...)
it should "encode into record" in runWithScioContext { sc =>
val createdAt = Instant.parse("2014-09-10T12:04:59.999Z")
val inputs = boundedTestCollectionOf[TollBoothStats]
.addElementsAtTime(createdAt, anyTollBoothStats)
.advanceWatermarkToInfinity()
val results = encodeRecord(sc.testBounded(inputs))
results should containElements(anyTollBoothStatsRecord.copy(created_at = createdAt))
}
}
Run the test:
1
2
3
4
sbt> testOnly *TollBoothStatsTest
[error] TollBoothStatsTest.scala:144:19: not found: value encodeRecord
[error] val results = encodeRecord(sc.testBounded(inputs))
[error]
Implement encodeRecord
function to fulfill the contract defined in the test:
1
2
3
4
5
6
7
8
9
10
11
12
13
object TollBoothStats {
def encodeRecord(input: SCollection[TollBoothStats]): SCollection[TollBoothStats.Record] =
input.mapWithTimestamp { case (record, timestamp) =>
Record(
created_at = timestamp,
id = record.id.id,
count = record.count,
total_toll = record.totalToll,
first_entry_time = record.firstEntryTime,
last_entry_time = record.lastEntryTime
)
}
}
Run the test again:
1
2
3
4
5
6
7
8
9
10
11
sbt> testOnly *TollBoothStatsTest
[info] TollBoothStatsTest:
[info] TollBoothStats
[info] - should calculate statistics in fixed window
[info] - should calculate statistics in fixed window for late entries
[info] - should encode into record
[info] Run completed in 6 seconds, 198 milliseconds.
[info] Total number of tests run: 3
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 3, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
Now all components are ready to finish the data pipeline and don’t break the job end-to-end test.
Note, that TollBoothEntryStatsJob
has little responsibility but delegates heavily to TollBoothEntry
and TollBoothStats
for executing actual domain logic.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import org.apache.beam.sdk.transforms.windowing.AfterWatermark
import org.apache.beam.sdk.transforms.windowing.Repeatedly
import org.apache.beam.sdk.transforms.windowing.Window
import org.apache.beam.sdk.values.WindowingStrategy.AccumulationMode
import com.spotify.scio.values.WindowOptions
import com.spotify.scio.ContextAndArgs
import org.joda.time.Duration
import org.joda.time.LocalDate
import org.mkuthan.streamprocessing.infrastructure._
import org.mkuthan.streamprocessing.infrastructure.bigquery.BigQueryPartition
import org.mkuthan.streamprocessing.infrastructure.bigquery.BigQueryTable
import org.mkuthan.streamprocessing.infrastructure.bigquery.RowRestriction
import org.mkuthan.streamprocessing.infrastructure.bigquery.StorageReadConfiguration
import org.mkuthan.streamprocessing.infrastructure.common.IoIdentifier
object TollBoothEntryStatsJob {
private val OneHour = Duration.standardHours(1)
private val OneDay = Duration.standardDays(1)
private val DefaultWindowOptions = WindowOptions(
trigger = Repeatedly.forever(AfterWatermark.pastEndOfWindow()),
allowedLateness = Duration.ZERO,
accumulationMode = AccumulationMode.DISCARDING_FIRED_PANES,
onTimeBehavior = Window.OnTimeBehavior.FIRE_IF_NON_EMPTY
)
def main(mainArgs: Array[String]): Unit = {
val (sc, args) = ContextAndArgs(mainArgs)
val effectiveDate = LocalDate.parse(args.required("effectiveDate"))
val entryTable = args.required("entryTable")
val entryStatsHourlyTable = args.required("entryStatsHourlyTable")
val entryStatsDailyTable = args.required("entryStatsDailyTable")
val entryRecords = sc.readFromBigQuery(
IoIdentifier[TollBoothEntry.Record]("entry-table-id"),
BigQueryTable[TollBoothEntry.Record](entryTable),
StorageReadConfiguration().withRowRestriction(
RowRestriction.TimestampColumnRestriction("entry_time", effectiveDate)
)
)
val entries = TollBoothEntry.decodeRecord(entryRecords)
val tollBoothStatsHourly = TollBoothStats
.calculateInFixedWindow(entries, OneHour, DefaultWindowOptions)
TollBoothStats
.encodeRecord(tollBoothStatsHourly)
.writeBoundedToBigQuery(
IoIdentifier[TollBoothStats.Record]("entry-stats-hourly-table-id"),
BigQueryPartition.daily(entryStatsHourlyTable, effectiveDate)
)
val tollBoothStatsDaily = TollBoothStats
.calculateInFixedWindow(entries, OneDay, DefaultWindowOptions)
TollBoothStats
.encodeRecord(tollBoothStatsDaily)
.writeBoundedToBigQuery(
IoIdentifier[TollBoothStats.Record]("entry-stats-daily-table-id"),
BigQueryPartition.daily(entryStatsDailyTable, effectiveDate)
)
sc.run()
}
}
Let’s revisit the end-to-end test. It remains unchanged, but I’ve included it here for your reference:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import com.spotify.scio.io.CustomIO
import com.spotify.scio.testing.JobTest
import org.joda.time.Instant
import org.scalatest.flatspec.AnyFlatSpec
import org.scalatest.matchers.should.Matchers
import org.mkuthan.streamprocessing.test.scio.JobTestScioContext
class TollBoothEntryStatsJobTest extends AnyFlatSpec with Matchers with JobTestScioContext
with TollBoothEntryFixture with TollBoothStatsFixture {
"Toll job" should "run in the batch mode" in {
JobTest[TollBoothEntryStatsJob.type]
.args(
"--effectiveDate=2014-09-10",
"--entryTable=toll.entry",
"--entryStatsHourlyTable=toll.entry_stats_hourly",
"--entryStatsDailyTable=toll.entry_stats_daily"
)
.input(
CustomIO[TollBoothEntry.Record]("entry-table-id"),
Seq(anyTollBoothEntryRecord)
)
.output(CustomIO[TollBoothStats.Record]("entry-stats-hourly-table-id")) { results =>
val endOfHourlyWindow = Instant.parse("2014-09-10T12:59:59.999Z")
results should containElements(
anyTollBoothStatsRecord.copy(created_at = endOfHourlyWindow)
)
}
.output(CustomIO[TollBoothStats.Record]("entry-stats-daily-table-id")) { results =>
val endOfDailyWindow = Instant.parse("2014-09-10T23:59:59.999Z")
results should containElements(
anyTollBoothStatsRecord.copy(created_at = endOfDailyWindow)
)
}
.run()
}
}
Run the test to verify data pipeline correctness:
1
2
3
4
5
6
7
8
9
sbt> testOnly *TollBoothEntryStatsJobTest
[info] TollBoothEntryStatsJobTest:
[info] Toll job
[info] - should run in the batch mode
[info] Run completed in 8 seconds, 704 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
Beautiful 💖, it’s now the moment to embark on the first deployment with the Dataflow runner, but that’s a tale for the upcoming blog post.
I created batch data pipeline for calculating toll booth statistics using test driven development:
By adhering to these principles and practices, you’ve established a structured and reliable methodology for developing your toll booth statistics data pipeline. This approach won’t only enhance the quality and robustness of your pipeline but also make it easier to maintain and extend in the future.
In this blog post, I didn’t explain code sample details. It’s your homework to study my playground repository for unified stream and batch processing, available at https://github.com/mkuthan/stream-processing/. There, you’ll find code examples, detailed implementations, and further insights into the toll booth related data pipeline discussed in this post. Feel free to explore and experiment with the code to enhance your understanding of TDD for data engineering.
]]>Below you can find the lambda architecture of the clickstream data enrichment system I developed around 2017. If you have never heard about lambda architecture, see Nathan Marz article How to beat the CAP theorem.
graph LR
subgraph speed-layer
kafka1[Kafka]-->kafka-streams[Kafka Streams]
kafka-streams-->kafka2[Kafka]
end
subgraph batch-layer
hdfs1[HDFS]-->spark[Apache Spark]
spark-->hdfs2[HDFS]
end
kafka2-->druid[Apache Druid]
hdfs2-->druid
lib(anemic domain library) .- kafka-streams
lib .- spark
Real-time data pipeline (a.k.a speed layer) consumed an infinite stream of events from Kafka cluster, performed stateless or stateful transformations, and published results again to Kafka. Batch data pipeline read complete partition of data from HDFS, performed transformations and wrote results to HDFS. Analytical databases like Apache Druid loaded real-time and batch results and acted as a serving layer for downstream consumers. To reduce duplication I extracted part of domain logic into a shared library. Speed and batch layers used the library to apply common processing logic.
In practice, such system had the following design flaws:
Kafka Streams and Apache Spark used different API for stateful operations like windowing and joining. I couldn’t reuse the code between speed and batch layers beyond the stateless map/filter operations.
Runtime environments for speed and batch layer were different. Kafka Streams run on Mesos cluster, Apache Spark on YARN. I couldn’t reuse deployment automation, monitoring and alerting.
While Apache Spark was a mature data processing framework, Kafka Streams didn’t meet my expectations. See my blog post Kafka Streams DSL vs processor API from 2017 for details.
In 2019 I moved from on-prem Hadoop to Google Cloud Platform and changed technology stack for developing data pipelines to Apache Beam / Spotify Scio. It’s still a lambda architecture but realization is much better than in 2017.
graph LR
subgraph Apache-Beam
direction TB
speed(speed layer).-lib(rich domain library).-batch(batch layer)
end
ps-in[Pubsub]-->Apache-Beam
bq-in[BigQuery]-->Apache-Beam
Apache-Beam-->ps-out[Pubsub]
Apache-Beam-->bq-out[BigQuery]
If I compare new architecture to the architecture from 2017:
Apache Beam allows to unify domain logic between batch and speed layers. Stateful operations like window joins or temporal joins are the same for streaming and batch.
Runtime environments for real-time and batch are almost the same. I deploy all pipelines on Dataflow, managed service on Google Cloud Platform.
Maturity of batch and streaming parts is similar. For example, Dataflow provides external services for data shuffling in batch and streaming.
However, the unification doesn’t mean that you can deploy exactly the same job in a streaming or in a batch manner.
There are no magic parameters like --batch
or --streaming
, you have to build such capability yourself by proper design.
Let’s start with a simple use-case where unified batch and streaming delivers real business value. Data pipeline for calculating statistics from toll booths you encounter on highways, bridges, or tunnels. I took inspiration from Azure documentation Build an IoT solution by using Stream Analytics.
graph LR
toll-booth-entry[Toll booth entries]-->pipeline[[Unified Data Pipeline]]
toll-booth-exit[Toll booth exits]-->pipeline
vehicle-registration-history[Vehicle registrations]-->pipeline
pipeline-->toll-booth-stat[Toll booth statistics]
pipeline-->total-vehicle-time[Total vehicle times]
pipeline-->vehicles-with-expired-registration[Vehicles with expired registrations]
pipeline-.->diagnostic[Diagnostics]
pipeline-.->dlq[Dead letters]
Streaming pipeline subscribes for events emitted when vehicles cross toll booths. On startup it also reads the history of vehicle registrations and after start gets real-time updates. The command line could looks like this:
1
2
3
4
5
TollBoothStreamingJob \
--entrySubscription=...
--exitSubscription=...
--vehicleRegistrationTable=...
--vehicleRegistrationSubscription=...
Batch version of the same pipeline reads historical data from a data warehouse and
calculates results for a given date specified as effeciveDate
parameter.
1
2
3
4
5
TollBoothBatchJob \
--effectiveDate=2014-09-04
--boothEntryTable=...
--boothExitTabel=...
--vehicleRegistrationTable=...
The sources of the pipelines are remarkably different. Streaming expects streams of near real-time data to deliver low latency results. Batch requires efficient and cheap access to historical data to process large volumes of data at once.
Regardless of the source of data, batch and streaming pipelines calculate similar statistics, for example:
The streaming job aggregates results in short windows to achieve low latency. It publishes statistics as streams of events for downstream real-time data pipelines. The streaming pipeline also detects vehicles with expired registrations, it’s more valuable than fraud detection in a daily batch.
1
2
3
4
TollBoothStreamingJob \
--entryStatsTopic=...
--totalVehicleTimeTopic=...
--vehiclesWithExpiredRegistrationTopic=...
The batch job aggregates statistics in much larger windows to achieve better accuracy. It writes results into a data warehouse for downstream batch data pipelines and reporting purposes.
1
2
3
4
TollBoothBatchJob \
--effectiveDate=2014-09-04
--entryStatsTable=...
--totalVehicleTimeTable=...
As you could see, the sinks of the pipelines are also different.
The streaming publishes low latency, append-only results as streams of events,
batch overwrites the whole partitions in data warehouse tables for specified effectiveDate
.
Because the streaming pipeline is hard to debug, it’s crucial to put aside some diagnostic information about the current state of the job. For example if the job receives a toll booth entry message for a given vehicle, but it doesn’t have information about this vehicle registration yet. This is a temporary situation if one stream of data is late (vehicle registrations) and the job produces incomplete results. With proper diagnostic you could decide to change streaming configuration and increase allowed lateness for the windowing calculation.
If there is an error in the input data, the batch pipeline just fails. You could fix invalid data and execute a batch job again.
In the streaming pipelines, a single invalid record could block the whole pipeline forever. How to handle such situations?
As you could see, the streaming and batch pipelines aren’t the same. They have different sources and sinks, use different parameters to achieve either lower latency or better accuracy.
How to organize the code to get unified architecture?
Split the codebase into three layers:
graph TB
application-. depends on .->domain
application-. uses .->infrastructure
Direct dependency between Domain and Infrastructure is forbidden, it would kill code testability.
Below you can find a function for calculating total time between vehicle entry and exit. The function uses a session window to join vehicle entries and vehicle exits within a gap duration. When there is no exit for a given entry, the function can’t calculate total time but emits diagnostic information.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import org.joda.time.Duration
import com.spotify.scio.values.SCollection
def calculateInSessionWindow(
boothEntries: SCollection[TollBoothEntry],
boothExits: SCollection[TollBoothExit],
gapDuration: Duration
): (SCollection[TotalVehicleTime], SCollection[TotalVehicleTimeDiagnostic]) = {
val boothEntriesById = boothEntries
.keyBy(entry => (entry.id, entry.licensePlate))
.withSessionWindows(gapDuration)
val boothExitsById = boothExits
.keyBy(exit => (exit.id, exit.licensePlate))
.withSessionWindows(gapDuration)
val results = boothEntriesById
.leftOuterJoin(boothExitsById)
.values
.map {
case (boothEntry, Some(boothExit)) =>
Right(toTotalVehicleTime(boothEntry, boothExit))
case (boothEntry, None) =>
Left(TotalVehicleTimeDiagnostic(boothEntry.id, TotalVehicleTimeDiagnostic.MissingTollBoothExit))
}
results.unzip
}
private def toTotalVehicleTime(boothEntry: TollBoothEntry, boothExit: TollBoothExit): TotalVehicleTime = {
val diff = boothExit.exitTime.getMillis - boothEntry.entryTime.getMillis
TotalVehicleTime(
licensePlate = boothEntry.licensePlate,
tollBoothId = boothEntry.id,
entryTime = boothEntry.entryTime,
exitTime = boothExit.exitTime,
duration = Duration.millis(diff)
)
}
The logic is exactly the same for the streaming and for the batch, there is no I/O related code here. Streaming pipeline defines shorter window gap to get lower latency, batch pipeline longer gap for better accuracy, this is the only difference.
Because this is a data processing code, don’t expect a pure domain without external dependencies. Domain logic must depend on Apache Beam / Spotify Scio to make something useful.
Application layer is a main differentiator between batch and streaming.
A slightly simplified version of the streaming pipeline might look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
def main(mainArgs: Array[String]): Unit = {
val (sc, args) = ContextAndArgs(mainArgs)
val config = TollStreamingJobConfig.parse(args)
// subscribe to toll booth entries and exits as JSON, put invalid messages into DLQ
val (entryMessages, entryMessagesDlq) =
sc.subscribeJsonFromPubsub(config.entrySubscription)
val (exitMessages, exitMessagesDlq) =
sc.subscribeJsonFromPubsub(config.exitSubscription)
// decode JSONs into domain objects
val (entries, entriesDlq) = TollBoothEntry.decodeMessage(entryMessages)
val (exits, exitsDlq) = TollBoothExit.decodeMessage(exitMessages)
// write invalid inputs to Cloud Storage
entriesDlq
.withFixedWindows(duration = TenMinutes)
.writeUnboundedToStorageAsJson(config.entryDlq)
exitsDlq
.withFixedWindows(duration = TenMinutes)
.writeUnboundedToStorageAsJson(config.exitDlq)
// calculate total vehicle times
val (totalVehicleTimes, totalVehicleTimesDiagnostic) =
TotalVehicleTime.calculateInSessionWindow(entries, exits, gapDuration = TenMinutes)
// write aggregated diagnostic to BigQuery
TotalVehicleTimeDiagnostic
.aggregateAndEncode(totalVehicleTimesDiagnostic, windowDuration = TenMinutes)
.writeUnboundedToBigQuery(config.totalVehicleTimeDiagnosticTable)
// encode total vehicle times as a message and publish on Pubsub
TotalVehicleTime
.encodeMessage(totalVehicleTimes)
.publishJsonToPubsub(config.totalVehicleTimeTable)
// encode total vehicle times and writes into BigQuery, put invalid writes into DLQ
val totalVehicleTimesDlq = TotalVehicleTime
.encodeRecord(totalVehicleTimes)
.writeUnboundedToBigQuery(config.totalVehicleTimeTable)
// union all DLQs as I/O diagnostics, aggregate and write to BigQuery
val ioDiagnostics = IoDiagnostic.union(
boothEntryMessagesDlq,
boothExitMessagesDlq,
totalVehicleTimesDlq
)
IoDiagnostic
.aggregateAndEncode(ioDiagnostics, windowDuration = TenMinutes)
.writeUnboundedToBigQuery(config.diagnosticTable)
// run the pipeline
sc.run()
}
The corresponding batch pipeline is less complex and looks like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
def main(mainArgs: Array[String]): Unit = {
val (sc, args) = ContextAndArgs(mainArgs)
val config = TollBatchJobConfig.parse(args)
// read toll booth entries and toll booth exits from BigQuery partition
val entryRecords = sc.readFromBigQuery(
config.entryTable,
StorageReadConfiguration().withRowRestriction(
RowRestriction.TimestampColumnRestriction("entry_time", config.effectiveDate)
)
)
val exitRecords = sc.readFromBigQuery(
config.exitTable,
StorageReadConfiguration().withRowRestriction(
RowRestriction.TimestampColumnRestriction("exit_time", config.effectiveDate)
)
)
// decode BigQuery like objects into domain objects
val entries = TollBoothEntry.decodeRecord(entryRecords)
val exits = TollBoothExit.decodeRecord(exitRecords)
// calculate total vehicle times
val (totalVehicleTimes, totalVehicleTimesDiagnostic) =
TotalVehicleTime.calculateInSessionWindow(entries, exits, gapDuration = OneHour)
// write aggregated diagnostic to BigQuery
TotalVehicleTimeDiagnostic
.aggregateAndEncode(totalVehicleTimesDiagnostic, windowDuration = OneDay)
.writeBoundedToBigQuery(config.totalVehicleTimeDiagnosticOneHourGapTable)
// encode total vehicle times and writes into BigQuery
TotalVehicleTime
.encodeRecord(totalVehicleTimes)
.writeBoundedToBigQuery(config.totalVehicleTimeOneHourGapPartition)
// run the pipeline
sc.run()
}
Upon initial glance streaming and batch pipelines look like a duplicated code which violates DRY principle (Don’t Repeat Yourself). Where’s batch and streaming unification?
Don’t worry, nothing is wrong with such design, it promotes clarity and ease of maintenance.
TollStreamingJobConfig
and TollBatchJobConfig
The example application doesn’t use dependency injection framework. Trust me, you don’t need any to write manageable data pipelines.
Where’s the infrastructure in the presented code samples? Because of Scala syntax powered by some implicit conversions it’s hard to spot.
Infrastructure layer provides all the functions specified below in a fully generic way.
subscribeJsonFromPubsub
- for subscribing to JSON messages on PubsubpublishJsonToPubsub
- for publishing JSON messages on PubsubreadFromBigQuery
- for reading from BigQuery using Storage Read APIqueryFromBigQuery
- for querying BigQuery using SQLwriteUnboundedToBigQuery
- for writing to BigQuery using Storage Write API with append modewriteBoundedToBigQuery
- for writing to BigQuery using File Loads with truncate modewriteUnboundedToStorageAsJson
- for writing JSON files with dead letters on Cloud StorageExtract the infrastructure module as a shared library and reuse it in all data pipelines. This is a significant investment in a high-quality and robust input/output connectors. Writing and testing I/O is complex so do it once, and do it well.
Unified batch and streaming processing is a powerful data architecture that seamlessly combines both batch and real-time data processing, enabling organizations to harness real-time insights while efficiently managing large historical data volumes. In the past, these two approaches were treated as separate data pipelines, leading to increased complexity in data infrastructure and codebase management.
However, it’s important to note that unified batch and streaming doesn’t entail running the exact same code for both approaches. Streaming pipelines prioritize lower latency, while batch processing focuses on higher accuracy.
To effectively implement this architecture, consider the following key strategies:
I’ve sourced all the code samples used in this blog post from my dedicated batch and stream processing playground repository, which you can find at https://github.com/mkuthan/stream-processing/. If you come across something intriguing yet unclear in this repository, please don’t hesitate to reach out. I’d be delighted to explore it further and create additional blog content on unified batch and stream processing.
]]>For time-series with data points with minute, second or millisecond precision, use time-unit column partitioning. You will get convenient querying like:
1
2
3
4
5
SELECT ts, temperate, pressure
FROM weather_stations
WHERE
ts BETWEEN TIMESTAMP("2008-12-25 15:30:00")
AND TIMESTAMP("2008-12-25 15:35:00")
With ingestion time partitioning, you have to specify extra predicate for partition pruning:
1
2
3
4
5
6
SELECT ts, temperate, pressure
FROM weather_stations
WHERE
ts BETWEEN TIMESTAMP("2008-12-25 15:30:00")
AND TIMESTAMP("2008-12-25 15:35:00")
AND _PARTITIONTIME = DATE("2008-12-15")
For aggregates, when you typically don’t need an exact point in time, you can use ingestion time partitioning. You don’t need to specify the time column explicitly in a table schema.
1
2
3
4
SELECT _PARTITIONTIME AS dt, page_views, unique_visitors
FROM ecommerce_sessions
WHERE
_PARTITIONTIME = DATE("2008-12-15")
You don’t pay for reading _PARTITIONTIME
column.
Thanks Krzysiek for pointing this out.
In my projects, the majority of tables require at least 1–3 years of history.
With a limit of 4000 partitions per BigQuery table, it requires daily partitioning.
Tables with 3 years of retention use 3 * 365 = 1095
daily partitions, which is below limit.
Tables with hourly partitions keep up to only 4000 / 24 = 166 days and 8 hours
of data.
For tables with more than 10 years of history I would consider another storage than BigQuery. Export partitions with data older than 10 years to Google Cloud Storage bucket using columnar format like Apache Parquet. Create BigQuery external table on the bucket using Hive partitioning scheme for querying.
Google Cloud Platform support could raise the limit, for example to 10000 partitions but don’t expect any guarantees in case of incidents
If you process data on a daily basis use daily partitioning for efficient partition pruning. If you process data on an hourly basis and don’t need 6+ months of history in the table, use hourly partitioning.
If you need to keep longer history, use daily partitioning and one of the following tricks for efficient querying:
For the trick with clustering on timestamp partitioning column, the following query reads only 1 minute of data in daily partitioned table:
1
2
3
4
5
SELECT ts, temperate, pressure
FROM weather_stations
WHERE
ts BETWEEN TIMESTAMP("2008-12-25 15:30:00")
AND TIMESTAMP("2008-12-25 15:31:00")
However, the timestamp clustering column has a huge entropy, so if you need more clustering columns you can’t use this trick.
For the trick with extra “hour” clustering column the following query reads one hour of data in daily partitioned table:
1
2
3
4
5
6
7
SELECT ts, temperate, pressure
FROM weather_stations
WHERE
ts BETWEEN TIMESTAMP("2008-12-25 15:30:00")
AND TIMESTAMP("2008-12-25 15:31:00")
AND _PARTITIONTIME = DATE("2008-12-15")
AND hour = 15
As you see, such a table isn’t convenient to query, the client must be aware of two extra predicates.
If you need querying data using different timezones, use timestamp column partitioning.
The following query automatically reads data from two daily partitions: 2008-12-24 00:00:00Z
and 2008-12-25 00:00:00Z
.
1
2
3
4
5
SELECT temperate, pressure
FROM weather_stations
WHERE
ts BETWEEN TIMESTAMP("2008-12-25 00:00:00", "CET")
AND TIMESTAMP("2008-12-26 00:00:00", "CET")
For ingestion time partitioning you could load data using table decorator and use whatever timezone you want instead of UTC.
If you load one day of data for Central European Time (CET) timezone using ecommerce_sessions$20081215
table decorator, the following query returns correct results:
1
2
3
4
SELECT DATE(_PARTITIONTIME) AS dt, page_views, unique_visitors
FROM ecommerce_sessions
WHERE
_PARTITIONTIME = DATE("2008-12-15")
Be aware, that you can’t query for a range in another timezone than used while loading partitions.
Moreover BigQuery always shows that _PARTITIONTIME
uses UTC timezone, which will be misleading for users.
If you don’t need half-hour or 45-minutes timezones, and long retention, you could use hourly partitioned tables by ingestion time using UTC.
The following query reads 24 hourly partitions and returns correct results:
1
2
3
4
5
SELECT TIMESTAMP(_PARTITIONTIME, "CET") AS ts, page_views, unique_visitors
FROM ecommerce_sessions
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP("2008-12-25 00:00:00", "CET")
AND TIMESTAMP("2008-12-26 00:00:00", "CET")
If you want to use Storage Write API for partitioned tables use column partitioning. The Storage Write API doesn’t support the use of partition decorators to write to the given partition.
If you want to use Streaming Inserts for partitioned tables use column partitioning. The Streaming Inserts has limited support for partition decorators. You can stream to partitions within the last 31 days in the past and 16 days in the future relative to the current date, based on current UTC time.
The Storage Write API and Streaming Inserts write data through the streaming buffer.
For ingestion time partitioned tables data in streaming buffer is temporary placed in the __UNPARTITIONED__
partition and has a NULL
value in _PARTITIONTIME
column.
One more reason to not use ingestion time partitioned tables for Storage Write API or Streaming Inserts.
Querying such tables is error prone.
I’m not aware of any Batch Loads limitations for partitioned tables.
If you write with WRITE_TRUNCATE
disposition to column partitioned table always define partition decorator.
If not, you truncate the whole table without any notice.
Thanks Michał for pointing this out.
Below you can find the matrix with cons and pros of different partitioning methods in BigQuery.
time-unit column daily | time-unit column hourly | ingestion time daily | ingestion time hourly | |
---|---|---|---|---|
Best for time-series | ✅ | ✅ | 🚫 | 🚫 |
Best for aggregates | 🚫 | 🚫 | ✅ | ✅ |
6+ months of retention | ✅ | 🚫 | ✅ | 🚫 |
10+ years of retention | 🚫 | 🚫 | 🚫 | 🚫 |
UTC only timezone | ✅ | ✅ | ✅ | ✅ |
Non-UTC timezone | ✅ | ✅ | limited | limited |
Many timezones | ✅ | ✅ | 🚫 | limited |
Storage Write API | ✅ | ✅ | 🚫 | 🚫 |
Streaming Inserts | ✅ | ✅ | limited | limited |
Batch Loads | ✅ | ✅ | ✅ | ✅ |
Partition pruning | natural | natural | less convenient | less convenient |
I would always prefer time-unit column partitioning with daily granularity as the least problematic.
]]>As a team leader I was responsible for a team of 4–5 software engineers. In 2014 Allegro was a relatively small company with flat organizational structure and processes like in a startup. Line management duties weren’t that demanding, I had to accept vacation requests, hire new engineers and help with the promotion process. Most of my time I spent on coaching and mentoring team members, designing architecture, doing code reviews, writing documentation and fighting with technical debt. Quite often I developed new features and fixed bugs. I regularly participated in on-call 24/7 shifts and debug production issues. I felt like a technical leader, not like a team leader or manager.
Team leaders in Allegro have a great technological autonomy and authority to make decisions.
Nowadays Allegro is a large company listed on a stock exchange. Organizational structure is more complex and processes are more formalized. As a leader you have a lot of interactions with other leaders, managers and stakeholders. You’re expected to work from the office, remote work is better suited for individual contributors. In my opinion, the technical leader role is evolving into a regular team leader/manager role. Some leaders I know are still coding and participate in on-call shifts, others are not.
All these changes made me think about my career path.
I decided to go back to a senior engineering role.
I’m in a transition period, but I spend more and more time on coding and less on management duties. I’m still responsible for the team, although I handed over technical leadership of some parts of the platform. The technical leadership of each subsystem is distributed among the most experienced engineers in the team. I’m trying hard to not abuse my authority and to not make final decisions for the parts I don’t own. We’re working together as before to make the platform better, and to make sure that there is no silo effect.
Everything is going well so far, and I’m happy with my decision.
Staff/principal engineer role seems to be a next step in a technical career path. In my previous company I was in a similar role for a while, but it looked more like a manager role than a software engineering role. In short, I was responsible for a new on-the-shelf product adoption. I managed a task force of a few engineers from various teams, and we evaluated the product and its integration with our systems. From the technical perspective, it wasn’t a technically challenging task. I spent most of my time on Excel sheets and on setting up the meetings.
I don’t want to repeat this mistake, so I decided to not apply for a principal engineer role in Allegro as for now. I’m going to watch this position for a while and see how it evolves.
Staff/principal engineer role doesn’t have to mean what you initially think.
I’m looking for a team leader who will be able to lead a team better than I did. Organize team meetings, do performance reviews, hire new engineers, communicate with other managers, manage budgets and take care of the team’s overall happiness.
The team consists of 5 software engineers, we’re not afraid to make mistakes and learn from them, we’re A-players:
If you’re interested in the team leader role, please contact me.
]]>Would you describe what type classes are?
Type classes are a concept in functional programming that provide a way to define a set of functions or operations that can be performed on a specific type or set of types.
In a type class, a set of functions or operations are defined, and any type that implements those functions or operations is said to be an instance of that type class. Type classes are used to achieve ad-hoc polymorphism, which means that different functions can be written for different types, but they can all share the same interface or name.
For example, in Haskell, the Eq type class defines the == and /= operators for comparing values of a type. If a type implements these operators, it is an instance of the Eq type class. This allows functions that operate on values of type Eq to be polymorphic and work with any type that is an instance of Eq.
Type classes provide a powerful abstraction mechanism in functional programming and are widely used in languages like Haskell, Scala, and Rust. They allow developers to write generic code that can operate on a wide range of types, while still enforcing constraints on those types to ensure they support the required operations.
Show me an example of type class in Scala.
Sure! Here is an example of a type class in Scala:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
trait Printable[A] {
def format(value: A): String
}
object PrintableInstances {
implicit val stringPrintable: Printable[String] =
new Printable[String] {
def format(value: String): String = value
}
implicit val intPrintable: Printable[Int] =
new Printable[Int] {
def format(value: Int): String = value.toString
}
implicit val booleanPrintable: Printable[Boolean] =
new Printable[Boolean] {
def format(value: Boolean): String = if (value) "yes" else "no"
}
}
object Printable {
def format[A](value: A)(implicit printable: Printable[A]): String =
printable.format(value)
def print[A](value: A)(implicit printable: Printable[A]): Unit =
println(format(value))
}
case class Person(name: String, age: Int)
object Person {
implicit val personPrintable: Printable[Person] =
new Printable[Person] {
def format(person: Person): String = s"${person.name} is ${person.age} years old."
}
}
import PrintableInstances._
import Printable._
val message1: String = format("hello world")
val message2: String = format(42)
val message3: String = format(true)
val message4: String = format(Person("Alice", 30))
print("hello world")
print(42)
print(true)
print(Person("Alice", 30))
In this example,
Printable
is a type class with a single type parameterA
. It defines a format method that takes a value of typeA
and returns a string representation of it.The PrintableInstances object contains implicit instances of the
Printable
type class for various types, includingString
,Int
,Boolean
, andPerson
.The
Printable
object provides two methods for working with instances of thePrintable
type class:format
andformat
takes a value of typeA
and an implicit instance ofPrintable[A]
and returns a string representation of the value.Finally, the example demonstrates how to use the
Printable
type class and its instances to format and print various types of values.
What are the advantages of that design?
There are several advantages of using type classes and this design pattern:
- Abstraction: Type classes allow for abstraction over data types. This means that functions can be written in a generic way that works for any data type that satisfies the constraints of the type class.
- Reusability: Type classes promote code reuse. Once a type class is defined and instances are created for various data types, the same functionality can be used across different parts of the codebase without duplicating code.
- Flexibility: Type classes allow for ad-hoc polymorphism. This means that different data types can be used interchangeably as long as they implement the required functions or operations.
- Separation of concerns: Type classes separate concerns by defining operations that are relevant to a specific data type. This helps keep code organized and easier to reason about.
- Testability: Type classes make testing easier by allowing for more focused testing of the functionality that is specific to a given type class.
Overall, type classes are a powerful abstraction mechanism that can make code more generic, reusable, flexible, and easier to test.
Ok, show me an example of abstraction.
Sure, here’s an example that demonstrates how type classes can provide abstraction over data types:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
trait Summable[A] {
def sum(a: A, b: A): A
}
object SummableInstances {
implicit val intSummable: Summable[Int] =
new Summable[Int] {
def sum(a: Int, b: Int): Int = a + b
}
implicit val stringSummable: Summable[String] =
new Summable[String] {
def sum(a: String, b: String): String = a + b
}
}
object Calculator {
def add[A](a: A, b: A)(implicit summable: Summable[A]): A =
summable.sum(a, b)
}
import SummableInstances._
val sum1: Int = Calculator.add(1, 2)
val sum2: String = Calculator.add("hello", " world")
println(sum1) // prints 3
println(sum2) // prints "hello world"
In this example,
Summable
is a type class with a single type parameterA
. It defines a sum method that takes two values of type A and returns their sum.The
SummableInstances
object contains implicit instances of the Summable type class forInt
andString
.The
Calculator
object provides a generic add method that takes two values of typeA
and an implicit instance ofSummable[A]
and returns their sum.Finally, the example demonstrates how the
Calculator
object can be used to add integers and concatenate strings using the same add method. This demonstrates the abstraction provided by theSummable
type class, which allows the same add method to be used with different data types that satisfy the constraints of the type class.
Nice example of abstraction indeed, show me now an example of reusability.
Sure! Here’s an example that shows how type classes can promote code reuse:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
trait Show[A] {
def show(a: A): String
}
object ShowInstances {
implicit val intShow: Show[Int] =
new Show[Int] {
def show(a: Int): String = a.toString
}
implicit val stringShow: Show[String] =
new Show[String] {
def show(a: String): String = a
}
}
object Logger {
def log[A](a: A)(implicit show: Show[A]): Unit =
println(show.show(a))
}
import ShowInstances._
val num = 42
val str = "Hello, world!"
Logger.log(num) // prints "42"
Logger.log(str) // prints "Hello, world!"
In this example,
Show
is a type class with a single type parameterA
. It defines ashow
method that takes a value of typeA
and returns a string representation of that value.The
ShowInstances
object contains implicit instances of theShow
type class forInt
andString
.The
Logger
object provides a generic log method that takes a value of type A and an implicit instance ofShow[A]
and logs the string representation of that value.Finally, the example demonstrates how the
Logger
object can be used to log both integers and strings using the same log method. This demonstrates the reusability provided by theShow
type class, which allows the samelog
method to be used with different data types that satisfy the constraints of the type class.
So, why ad-hoc polymorphism is so flexible?
Ad-hoc polymorphism is flexible because it allows for the implementation of the same functionality across different data types without the need for inheritance or a common interface.
In traditional polymorphism, the behavior of a method or function is determined by the runtime type of the object or value being passed in. This means that the behavior is static and cannot be changed without modifying the source code.
Ad-hoc polymorphism, on the other hand, allows the behavior of a method or function to be determined by the implementation of a type class for a particular data type. This means that the behavior can be customized for different data types without changing the source code of the method or function.
This flexibility is particularly useful in situations where different data types have different requirements or semantics for a given operation or behavior. With ad-hoc polymorphism, the behavior can be customized for each data type separately without resorting to complex inheritance hierarchies or interfaces.
Overall, ad-hoc polymorphism provides a powerful mechanism for achieving flexibility and extensibility in code. It allows for a high degree of customization and can help simplify code by removing the need for complex hierarchies or interfaces.
Could you show example of separation of concerns with type classes?
Sure! Here’s an example that shows how type classes can be used to separate concerns in a simple web application:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
case class User(id: Int, name: String)
trait UserRepository {
def find(id: Int): Option[User]
}
class InMemoryUserRepository extends UserRepository {
private val users = Map(1 -> User(1, "Alice"), 2 -> User(2, "Bob"))
def find(id: Int): Option[User] = users.get(id)
}
trait UserJsonConverter {
def toJson(user: User): String
}
class UserJsonConverterImpl extends UserJsonConverter {
def toJson(user: User): String = s"""{"id": ${user.id}, "name": "${user.name}"}"""
}
class UserController(userRepository: UserRepository, jsonConverter: UserJsonConverter) {
def findUser(id: Int): String = {
userRepository.find(id) match {
case Some(user) => jsonConverter.toJson(user)
case None => s"""{"error": "User with ID $id not found"}"""
}
}
}
object ExampleApp extends App {
val userRepository = new InMemoryUserRepository()
val jsonConverter = new UserJsonConverterImpl()
val userController = new UserController(userRepository, jsonConverter)
println(userController.findUser(1)) // prints {"id": 1, "name": "Alice"}
println(userController.findUser(3)) // prints {"error": "User with ID 3 not found"}
}
In this example, we have separated the concerns of data storage, JSON serialization, and web application logic using type classes.
The
UserRepository
trait defines an interface for accessing user data, with a concrete implementation provided by InMemoryUserRepository.The
UserJsonConverter
trait defines an interface for convertingUser
objects to JSON strings, with a concrete implementation provided byUserJsonConverterImpl
.The
UserController
class defines a web application endpoint for finding user data. It takes instances ofUserRepository
andUserJsonConverter
as dependencies and uses them to find the requested user and convert it to a JSON string.By separating concerns in this way, we can change the implementation of
UserRepository
orUserJsonConverter
without affecting the implementation ofUserController
. This makes our code more modular and easier to maintain.
Nice, why it’s easier to test type classes?
Type classes can be easier to test than traditional inheritance hierarchies or interfaces because they promote a separation of concerns and can reduce coupling between different parts of the code.
With traditional polymorphism, the behavior of a method or function is determined by the runtime type of the object or value being passed in. This can make it difficult to write unit tests for the method or function because you may need to create test objects that implement the same interface as the production code.
With type classes, on the other hand, the behavior of a method or function is determined by the implementation of a type class for a particular data type. This means that you can write unit tests for the type class separately from the production code, using mock objects or test doubles to simulate the different data types that the code will be operating on.
For example, in the web application example I showed earlier, we could write a unit test for
UserJsonConverter
by creating a mockUser
object and verifying that thetoJson
method returns the expected JSON string. We could then write a separate unit test forUserController
that verifies that it calls the find method on theUserRepository
with the expected user ID and passes the resulting user object to thetoJson
method of theUserJsonConverter
.Overall, type classes can make it easier to write testable code by promoting a separation of concerns and reducing coupling between different parts of the code.
Could you show an example of type classes test using Scalatest library?
Sure! Here’s an example of how you could use the Scalatest library to test the
UserJsonConverter
type class from the web application example I showed earlier:
1
2
3
4
5
6
7
8
9
10
11
import org.scalatest._
class UserJsonConverterSpec extends FlatSpec with Matchers {
"UserJsonConverter" should "convert a User object to a JSON string" in {
val user = User(1, "Alice")
val converter = new UserJsonConverterImpl()
val expectedJson = """{"id": 1, "name": "Alice"}"""
converter.toJson(user) shouldBe expectedJson
}
}
In this test, we create a
User
object and aUserJsonConverterImpl
object, and then call thetoJson
method of the converter object with the user object as an argument. We then use the ScalatestshouldBe
matcher to verify that the result is equal to the expected JSON string.This test verifies that the
UserJsonConverter
type class is able to correctly serializeUser
objects to JSON strings, without needing to create a mockUser
object or modify the production code in any way.You could write similar tests for the
UserRepository
andUserController
components of the web application, using test doubles or mock objects to simulate different data types or dependencies.
Hmm, I don’t really like the UserJsonConverterImpl
implementation, it’s so manual.
It doesn’t scale if my application has hundreds of domain classes.
How to derive converters in a more automated way?
You’re right, manually implementing converters for each domain class can become cumbersome and error-prone as the number of classes grows. One way to derive converters in a more automated way is to use a code generation tool, such as ScalaPB or Circe.
ScalaPB is a protocol buffer compiler plugin for Scala that generates case classes and corresponding JSON converters from
.proto
files. Circe is a JSON library for Scala that provides automatic derivation of JSON codecs (encoders and decoders) for case classes using macros.Here’s an example of how you could use Circe to automatically derive a JSON encoder and decoder for the
User
case class:
1
2
3
4
5
6
7
8
9
import io.circe._
import io.circe.generic.semiauto._
case class User(id: Int, name: String)
object UserJson {
implicit val userEncoder: Encoder[User] = deriveEncoder[User]
implicit val userDecoder: Decoder[User] = deriveDecoder[User]
}
In this example, we define the
User
case class and a companion objectUserJson
that contains two implicit values for encoding and decoding User objects to and from JSON, respectively.The
deriveEncoder
andderiveDecoder
methods are macros that generate the necessary code for encoding and decoding theUser
case class, based on its fields and types. This means that we don’t need to write any boilerplate code to convertUser
objects to and from JSON.To use the
UserJson
codecs, we can simply import them into other parts of the code:
1
2
3
4
5
import UserJson._
val user = User(1, "Alice")
val json = userEncoder(user).noSpaces
val decodedUser = decode[User](json)
This example shows how type classes and code generation tools can work together to provide flexible and automated serialization and deserialization of domain objects.
How to test that a derived encoder is able to encode different values for the id
and name
fields of the User
case class?
One way to test that the derived encoder is able to handle a variety of values for the
id
andname
fields of theUser
case class is to write a test that generates randomUser
objects and encodes them to JSON, then decodes the JSON and checks that the resulting object is equal to the original.Here’s an example of how you could write such a test using the Scalacheck library:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import org.scalatest._
import org.scalatestplus.scalacheck.ScalaCheckPropertyChecks
import io.circe.syntax._
import io.circe.parser._
class UserJsonSpec extends FlatSpec with Matchers with ScalaCheckPropertyChecks {
"UserJson" should "encode and decode random User objects" in {
forAll { (id: Int, name: String) =>
val user = User(id, name)
val json = user.asJson.noSpaces
val decodedUser = decode[User](json)
decodedUser shouldBe Right(user)
}
}
}
In this test, we use the Scalacheck
forAll
method to generate randomInt
andString
values for theid
andname
fields of theUser
case class, respectively. We then create aUser
object with those values, encode it to JSON using theasJson
method from Circe, decode the resulting JSON using the decode method from Circe, and finally check that the decoded object is equal to the original object using the ScalatestshouldBe
matcher.This test verifies that the derived encoder and decoder are able to correctly handle a wide range of values for the id and name fields of the User case class, without needing to manually specify test cases for each possible value.
Thank you for the nice evening.
You’re welcome! It was a pleasure helping you. Have a good evening!
User
case class, let me disagree with thatI’m going to repeat such chats occasionally, and warmly encourage you to try it also. Let me know if you find any mistakes in ChatGPT responses 😀
]]>Technical debt tends to increase naturally over time. The complexity grows, architecture doesn’t match the new requirements, developers leave and come, priorities change and more. I’ve observed the following enterprise software lifecycle:
The oldest services and data pipelines your team supports, might work smoothly without any intervention. You deployed them last time a quarter ago, you even didn’t check it out from the repository to your newly installed laptop. You almost forget about these services, they’re beyond the horizon.
Over time your small team has 20+ services and hundreds data pipelines to support, but only a few of them are under active development. Everything seems to be fine until someone asks you for making an important and urgent change in one of the oldest services you support. What could go wrong?
No one wants to make changes in haunted graveyards with high risk of spectacular failure
I’m primarily responsible for Allegro clickstream ingestion platform. Highly scalable and fault-tolerant piece of services and data pipelines that process billions of events every day. The platform is an important part of Allegro ecosystem, it delivers data for modeling recommendations, tuning search index, calculating ads metrics, building client profiles and more.
The oldest parts of the platform are 7–8 years old, stability is outstanding, everything just works. Because a data model is like a generic map, change requests are rare, data producers are able to change the payload without any modification in the data pipelines. Do you see a scratch on the glass?
Last year I read Software Engineering at Google book and found interesting argument:
You may choose to not change things, but you need to be capable
Am I capable of making the change in finite time, expected quality and without a risk? Quick analysis proved that components on the critical path are outdated and deployment intervals were longer than a quarter. For example, one of the services use the following versions of software:
There were no end2end tests and deployment was a semi-manual process. After merging features to the main branch, GitHub action released the components and put them into artifact registry. Developers had to deploy released components using the Allegro management GUI console on DEV, TEST, Canary and finally the PROD environment. Not a big deal if you deploy components quarterly.
I convinced my engineering manager to update critical components in the clickstream ingestion platform. We were going to spend precious resources not for delivering new features but to be capable of doing it 😀
I was also thinking about how to prevent backsliding in the future with minimal effort. If components rusted once, they will rust again and again.
What could be a continuous reason for modifications if there are no business change requests?
Dependency updates managed by Dependabot is a way to enforce updates for the projects in maintenance phase
Enabling automated dependency updates is a 5 minute task for projects hosted on GitHub, but preparing projects for such updates could take weeks. You have to set up projects properly, if not - automated dependency updates will kill your productivity. You have to apply changes and deploy often without any fear.
For the clickstream ingestion platform I created the list of minimal prerequisites to enable Dependabot:
After the preparation step you can plan updates, one by one with a full deployment cycle every time. Improve and expand automated tests if you observe any regression.
We applied the following war plan for a battle with technical debt:
On almost every Monday we’ve got new pull requests with dependency updates. If there is a regular story scheduled for a sprint, all waiting dependency updates become part of the task. For projects with enabled Dependabot, up-to-date dependencies is an implicit definition of done for any user story.
If the project isn’t under active development, Dependabot pull requests patiently wait in a queue until one of the software engineers has some spare time. Because everything is automated, such an update doesn’t take more than half an hour. You have to break the old habit that after every deployment you should look into logs, check the metrics and do some manual smoke tests. No – review, merge, and forget.
If dependency update builds fail, software engineers create an issue to analyze the problem. It’s always good to know that one of your dependencies isn’t compatible any more. If the team can’t afford the change in the near future, software engineers document technical debt as an ignore clause in the Dependabot configuration file. For example we can’t use the latest Akka version due to changed licensing policy.
1
2
3
4
5
6
7
8
9
updates:
- package-ecosystem: "gradle"
directory: "/"
schedule:
interval: "weekly"
ignore:
# Akka versions >= 2.7.x have changed licensing
- dependency-name: "com.typesafe.akka:akka-*"
versions: [ "[2.7.0,)" ]
If your corporate policy permits deployment without any acceptance, you could enable automatic pull request merge and deployments, always if build is green. In Allegro at least one software engineer must accept the change so I couldn’t automate the process fully.
For services without risk of data loss, you should enable automated deployment on the PROD environment as well. For example, if there is an incident for Kafka based data pipelines, deploy the job again and replay events from the Kafka cluster.
3.3.0-SNAP3
in 2020, but the latest version is 3.2.15.I would strongly recommend fighting with technical debt using Dependabot. At least for the services on the critical path if they’re on the maintenance phase.
Do you remember?
]]>You may choose to not change things, but you need to be capable
Today, I would like to share my experiences on how to choose appropriate tools for writing better documentation.
Documentation as a code is a key enabler for all tools and practices presented in this blog post. It enables the culture where developers take ownership of documentation. With documentation as a code culture, developers:
I used to write documentation using Atlassian Confluence Wiki platform. Regardless of efforts, the documentation was always rusty.
Use whatever plain text markup you like: Markdown, reStructuredText, AsciiDoc, DocBook or LaTeX. Check only, if the format meets the following criteria:
I used to write documentation using Latex and DocBook. Although, for the last 10 years I’ve been using Markdown.
Markdown is perfect for a single README, but how to build documentation consists of many pages? Some parser must convert a bunch of Markdown files into a beautiful website.
Again, use whatever converter you like but check if the following capabilities exist for generated website:
Because you are writing technical documentation look also for:
At the time of writing my favorite Markdown converter is Jekyll but I would also like to check Hugo.
With Jekyll you can serve your website locally using the following command:
1
2
3
4
5
6
7
8
9
10
11
12
$ jekyll serve --livereload
Configuration file: /Users/marcin/tmp/site/my-new-site/_config.yml
Source: /Users/marcin/tmp/site/my-new-site
Destination: /Users/marcin/tmp/site/my-new-site/_site
Incremental build: disabled. Enable with --incremental
Generating...
Jekyll Feed: Generating feed for posts
done in 0.404 seconds.
Auto-regeneration: enabled for '/Users/marcin/tmp/site/my-new-site'
LiveReload address: http://127.0.0.1:35729
Server address: http://127.0.0.1:4000/
Server running... press ctrl-c to stop.
Every edit re-generates pages for modified files and reloads the web browser. Convenient way of writing documentation, you continuously observe the results of editing.
For my personal blog I use minimal-mistakes theme. Advertised as “perfect for hosting your personal site, blog, or portfolio.” Look at https://github.com/mkuthan/mkuthan.github.io repository if you want to know, how to get minimal but fully functional blog posts website like this one 😀
For technical documentation I prefer just-the-docs theme. Below you can see documentation site of my team (please note that we’re during migration from Wiki and the site isn’t finished yet):
I couldn’t open source code of the repository but I will be glad to share with you essential configuration and customizations.
Gemfile
with Ruby dependencies for local run:
1
2
3
source 'https://artifactory.allegrogroup.com/artifactory/rubygems.org'
gem 'github-pages', '~> 227', group: :jekyll_plugins
Jekyll configuration file _config.yml
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
title: "Foobar team documentation"
url: https://foobar-documentation.gh.allegrogroup.com
logo: /assets/images/foobar-logo.png
remote_theme: just-the-docs/just-the-docs
# color configuration for {: .note } and {: .important }
callouts:
note:
color: blue
important:
color: red
# enable mermaid graphs
mermaid:
version: "9.1.7"
# enable footer link for easy navigation to the top of the page
back_to_top: true
back_to_top_text: "Back to top"
# enable footer link to the GitHub editor for the page
gh_edit_link: true
gh_edit_repository: https://github.com/allegro-internal/foobar-documentation
gh_edit_link_text: "Edit this page on GitHub"
gh_edit_branch: "master"
gh_edit_view_mode: "edit"
# header links
aux_links:
"Quick link":
- "https://c.qxlint/foobar"
"GitHub repository":
- "https://github.com/allegro-internal/foobar-documentation"
aux_links_new_tab: true
Customized _layout/page.html
to automatically generate a table of contents on every page.
See jekyll-toc repository for _includes/toc.html
file.
1
2
3
4
5
6
---
layout: default
---
<!-- generate TOC on every page, see _includes/toc.html -->
{% include toc.html html=content %}
{{ content }}
That’s all folks, Jekyll and “Just the Docs” theme do all the hard work and you could focus on writing the documentation.
The ultimate goal of the automation is to deploy documentation on every commit or merge to the main branch of the repository. GitHub provides GitHub pages, hosting service that takes HTML, CSS and JavaScript files from a repository, runs the files through a build process, and publishes a website.
If you use only allowed Jekyll plugins and remote theme, GitHub publishes the site automatically. You only have to configure the “Pages” section in the repository settings.
If you are using plugins not supported by GitHub Pages, you have to build the website using GitHub actions. Below you can see the action I configured for my blog:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
name: Publish
on:
push:
branches:
- master
jobs:
jekyll:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: ruby/setup-ruby@v1
with:
ruby-version: "2.7"
bundler-cache: true
- run: bundle exec jekyll build
- uses: peaceiris/actions-gh-pages@v3
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_branch: gh-pages
publish_dir: ./_site
_site
directory (line 25) to gh-pages
branch (line 24) and configure repository to watch for the documentation in that branch instead of master or main.According to KISS principle, I would prefer automated publication from GitHub Pages than custom action.
To keep Ruby gems and GitHub actions up-to-date configure Dependant Bot in .github/dependantbot.yml
file:
1
2
3
4
5
6
7
8
9
10
version: 2
updates:
- package-ecosystem: "github-actions"
directory: "/"
schedule:
interval: "weekly"
- package-ecosystem: "bundler"
directory: "/"
schedule:
interval: "weekly"
You are a developer not a technical writer, so any automated writing help is more than welcome. Recently I found an excellent framework vale.sh for creating and enforcing custom rules. Fortunately you don’t have to develop the rules from scratch, vale.sh provides ready to use styles, for example:
On my blog I use “Microsoft writing style guide” and “Proselint” writing styles together. Below you can find a few code review comments for this blog post generated by vale.sh:
Flesh Kincaid grade level rule. The number of years of education generally required to understand the text 😂
If a series has more than three items or the items are long, consider a bulleted list to improve readability, explanation.
Use singular first-person pronouns sparingly, explanation.
Fix all typos or define your own vocabulary.
Check all styles provided as vale.sh packages and choose the one or two suits best for your writing needs. Using many styles together gives you redundant alerts, more isn’t always better.
Define .vale.ini
configuration file:
1
2
3
4
5
6
StylesPath = .github/vale
Vocab = Blog
Packages = Microsoft, proselint
[*.md]
BasedOnStyles = Vale, Local, Microsoft, proselint
.github/vale
directory, line 1.github/vale/Vocab/Blog
directory, line 2Define custom vocabulary in .github/vale/Vocab/Blog/accept.txt
and .github/vale/Vocab/Blog/reject.txt
files.
See reference documentation for more details about the syntax.
Define custom rules in .github/vale/Local
directory, for example Flesh Kincaid grade level rule looks as follow:
1
2
3
4
5
6
7
8
extends: metric
message: "Try to keep the Flesch–Kincaid grade level (%s) below 10."
link: https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests
formula: |
(0.39 * (words / sentences)) + (11.8 * (syllables / words)) - 15.59
condition: "> 10"
Vale compiles into a fast, dependency-free binary for macOS, Windows, and Linux.
brew install vale
vale sync
, don’t commit installed packages!vale document.md
, or all documents in a directory: vale ./directory
You should get the results like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
$ vale _posts/2022-10-24-technical-writing.md
_posts/2022-10-24-technical-writing.md
33:70 suggestion Use the Oxford comma in Microsoft.OxfordComma
'AsciiDoc, DocBook or LaTeX.'.
65:81 suggestion Did you really mean 'yuml'? Vale.Spelling
89:1 error Remove 'Very'. proselint.Very
91:5 suggestion '"Minimal Mistakes" theme' Microsoft.Headings
should use sentence-style
capitalization.
98:5 suggestion '"Just the Docs" theme' Microsoft.Headings
should use sentence-style
capitalization.
101:67 warning Try to avoid using Microsoft.We
first-person plural like 'we'.
181:5 suggestion 'GitHub actions' should use Microsoft.Headings
sentence-style capitalization.
216:15 suggestion 'KISS' has no definition. Microsoft.Acronyms
246:4 suggestion Did you really mean Vale.Spelling
'Proselint'?
248:55 suggestion Did you really mean Vale.Spelling
'Proselint'?
252:8 suggestion Did you really mean 'Kincaid'? Vale.Spelling
282:28 suggestion Did you really mean Vale.Spelling
'proselint'?
288:74 suggestion Did you really mean 'Kincaid'? Vale.Spelling
309:5 suggestion 'GitHub action' should use Microsoft.Headings
sentence-style capitalization.
335:3 error 'TODO' left in text. proselint.Annotations
335:41 suggestion Did you really mean 'jekyll'? Vale.Spelling
✖ 2 errors, 1 warning and 13 suggestions in 1 file.
As you see, custom vocabulary definition is a must.
Vale.Spelling
rule kindly reminds you with the “Did you really mean” question every time you use an unknown word.
If you want to get pull request comments from vale.sh define GitHub action using vale-action.
The action installs vale.sh packages, checks documents under _post
directory and adds comments to the pull request.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
name: Lint
on: [ pull_request ]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: errata-ai/vale-action@v2
with:
files: _posts/
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
Pull request comments are useful only if they’re context aware.
Action vale-action
put comments only for the modified files, thanks to reviewdog framework.
With plain text markup like Markdown you can edit documents in any text editor. Why not make your life easier with a more sophisticated tool? If you have never used Visual Studio Code give it a try.
I fell in love with Visual Studio Code because:
Why did I choose to learn Go?
After quick look at Reddit I realized that Go is a controversial language: What do you like about Go?, Why do people hate Go?. But in general one opinion comes up frequently:
If you want to adopt Go, you shouldn’t try to blindly translate patterns from other languages.
So, I wanted to learn how to write idiomatic code in Go and understand the reasons for the design decisions. I started my journey with book “Learning Go” written by Jon Bodner:
The book is well written, shows many examples and explains idiomatic Go design patterns. The book is for Go version 1.15 with a chapter about generics introduced in version 1.18. The current Go version is 1.19 as of this writing, but I didn’t find any groundbreaking new features in the latest releases.
The most fascinating part of the book for me, was chapter 10 “Concurrency in Go”. Concurrency model based on Communicating sequential processes paper from 1978. If you want to see how Go implements (more or less) this concurrency model, watch the session from Gopher conference presented by Kavya Joshi.
After this short video I had more unanswered questions than before, and I reached for another book: “Concurrency in Go” written by Katherine Cox-Buday 😀
Finally, I started writing extensive summary of Go language, what I really like, what I don’t like and where I haven’t formed an opinion yet. After a week or so, I figured out that “my summary” has already been written by Sylvain Valez in the article Go: the Good, the Bad and the Ugly. The article covers all my personal notes taken during the books’ lecture and more. And I fully agree with Sylvian, really no single point for dispute.
My final, personal verdict about Go language is …
I don’t buy it. I can’t accept language with mutable basic data structures and lack of support for functional programming paradigm. Not in 2022.
Did I waste my time learning Go? Definitely not, and I would recommend learning Go if you are a Java, Scala, Python or JavaScript programmer. Even if you are using more sophisticated language on daily basis, learning Go gives a lesson, that developing “too smart” code doesn’t make you better software engineer 😜
]]>In short, the book greatly enhanced my software engineering comprehension and gave me motivation for further experimentation. The book wasn’t easy to digest, many authors, different writing styles, repetitions across chapters, and over 600 pages in paperback version. But I don’t regret even a minute of reading. I rated the book with 5 stars on Goodreads. Until now only 17 of 80 technical books I read have got top scores 😀.
Below you can find a few personal notes taken when I was reading the book.
Ubiquitous law, referenced many times in the book.
With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.
For example, Linux wraps process identifiers (PID) when they exceed 32768 (2^15
).
Even if you can set the higher value, it would break because many libraries assume that the maximum is 32768.
Sustainable software
You may choose to not change things, but you need to be capable.
Blaze
See also the open source version of Blaze - Bazel.
Rules help us to manage complexity and build a maintainable codebase.
See also: google-java-format, Yapf, ClangFormat and Gofmt.
The code review process is the primary developer workflow upon which almost all other processes must hang, from testing to static analysis to CI.
Testing
It has enabled us to build larger systems with larger teams, faster than we ever thought possible.
Documentation
Software engineers need to realize that producing quality documentation is part of their job and saves them time and effort in the long run.
Deprecation strategy
Scalably maintaining complex software systems over time is more than just building and running software: we must also be able to remove systems that are obsolete or otherwise unused.
Global code search with full understanding of the structure and references, greatly simplifies development.
How others have done something with this API? Where a specific piece of information exists in the codebase? What a specific part of the codebase is doing? Why a certain piece of code was added, or why it behaves in a certain way?
See also a primary building block of the code search: kythe.io
Large Scale Changes (LSC)
No matter the size of your organization, it’s reasonable to think about how you would make these kinds of sweeping changes across your collection of source code. Whether by choice or by necessity, having this ability will allow greater flexibility as your organization scales while keeping your source code malleable over time.
See Sourcegraph.
Goal / Signals / Metrics (GSM)
Engineering manager vs. Tech lead vs. Tech lead manager
Engineering manager:
Responsible for the performance, productivity, and happiness of every person on their team including their tech lead while still making sure that the needs of the business are met by the product for which they are responsible
Tech lead:
Responsible for the technical aspects of the product, including technology decisions and choices, architecture, priorities, velocity, and general project management (although on larger teams they might have program managers helping out with this). Also an individual contributor.
Tech lead manager:
A single person who can handle both the people and technical needs of their team. At Google, it’s customary for larger, well-established teams to have a pair of leaders—one TL and one engineering manager—working together as partners. The theory is that it’s really difficult to do both jobs at the same time (well) without completely burning out, so it’s better to have two specialists crushing each role with dedicated focus.
Tips and tricks
Tools and libraries
]]>In this blog post you will learn:
Diagram below presents the streaming data pipeline for IoT like system:
flowchart LR
Sensor1 --> A1[Unreliable Network] --> A2[/Pubsub topic 1/] --> A3[Tumble Window]
Sensor2 --> B1[Unreliable Network] --> B2[/Pubsub topic 2/] --> B3[Tumble Window]
subgraph Dataflow SQL
A3[Tumble Window] --> C1(Left Outer Join)
B3[Tumble Window] --> C1(Left Outer Join)
end
C1(Left Outer Join) --> C2[(BigQuery Table)]
There are two sensors that send their readings periodically over unreliable network to Pubsub topics.
The payload looks like ts
with a sensor event timestamp and val
with sensor reading:
1
2
3
{'ts': '2022-08-07T10:28:28+00:00', 'val': 0.69}
{'ts': '2022-08-07T10:28:29+00:00', 'val': 0.67}
{'ts': '2022-08-07T10:28:30+00:00', 'val': 0.89}
Next, there is a Dataflow SQL job which subscribes to the topics and calculate the averages of the sensor readings in short non-overlapping fixed windows (tumble windows). The averages from both sensors are joined and the job calculates the difference between them. The readings from the right stream are optional, so if there is no data from second sensor the calculation is done within data from the first sensor only. Finally, the results of the computation are written continuously to BigQuery table for further analysis.
The query:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
SELECT
CURRENT_TIMESTAMP, s1.window_start, s1.window_end,
s1.samples + COALESCE(s2.samples, 0),
s1.value_avg - COALESCE(s2.value_avg, 0.0)
FROM (
SELECT
TUMBLE_START(event_timestamp, INTERVAL '5' SECOND) AS window_start,
TUMBLE_END(event_timestamp, INTERVAL '5' SECOND) AS window_end,
AVG(stream1.payload.val) AS value_avg,
COUNT(*) AS samples
FROM stream1
GROUP BY TUMBLE(event_timestamp, INTERVAL '5' SECOND)) s1
LEFT OUTER JOIN (
SELECT
TUMBLE_START(event_timestamp, INTERVAL '5' SECOND) AS window_start,
TUMBLE_END(event_timestamp, INTERVAL '5' SECOND) AS window_end,
AVG(stream2.payload.val) AS value_avg,
COUNT(*) AS samples
FROM stream2
GROUP BY TUMBLE(event_timestamp, INTERVAL '5' SECOND)) s2
ON s1.window_start = s2.window_start AND s1.window_end = s2.window_end
Create Pubsub topics for streams of sensor data:
1
2
3
4
$ gcloud pubsub topics create marcin-atm22-stream-1
Created topic [projects/sc-9366-nga-dev/topics/marcin-atm22-stream-1].
$ gcloud pubsub topics create marcin-atm22-stream-2
Created topic [projects/sc-9366-nga-dev/topics/marcin-atm22-stream-2].
Starting with version 2.6.0 Beam SQL includes an interactive shell, called the Beam SQL shell. The shell supports writing the pipelines as SQL queries and execute the queries on Direct runner. You can also specify different runner, for example Dataflow and execute the pipeline on Google Cloud Platform.
The latest working version of Apache Beam SQL shell is 2.38. See https://github.com/apache/beam/issues/22615 for more details.
To use Beam SQL shell you must build the binary from sources. Clone Apache Beam repository and execute the following command to build the SQL shell with support for Google IO connectors and Dataflow runner.
1
2
3
4
5
6
$ ./gradlew -p sdks/java/extensions/sql/shell \
-Pbeam.sql.shell.bundled=':sdks:java:io:google-cloud-platform,:runners:google-cloud-dataflow-java' \
installDist
(...)
BUILD SUCCESSFUL in 22s
88 actionable tasks: 43 executed, 35 from cache, 10 up-to-date
Run Apache Beam SQL shell and check for registered tables:
1
2
3
4
5
6
7
8
9
10
11
$ ./sdks/java/extensions/sql/shell/build/install/shell/bin/shell
Welcome to Beam SQL 2.38.0-SNAPSHOT (based on sqlline version 1.4.0)
0: BeamSQL> !tables
+-----------+-------------+------------+--------------+---------+----------+------------+
| TABLE_CAT | TABLE_SCHEM | TABLE_NAME | TABLE_TYPE | REMARKS | TYPE_CAT | TYPE_SCHEM |
+-----------+-------------+------------+--------------+---------+----------+------------+
| | metadata | COLUMNS | SYSTEM TABLE | | | |
| | metadata | TABLES | SYSTEM TABLE | | | |
+-----------+-------------+------------+----------- -+---------+----------+------------+
0: BeamSQL> !quit
Closing: org.apache.beam.sdk.extensions.sql.impl.JdbcConnection
With Beam SQL extension you can map Pubsub topics and BigQuery tables as external tables. It’s required to use any external storage system in SQL queries.
Create external tables for Pubsub topics:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
0: BeamSQL> CREATE EXTERNAL TABLE stream1 (
. . . . . > event_timestamp TIMESTAMP, attributes MAP<VARCHAR, VARCHAR>,
. . . . . > payload ROW<ts TIMESTAMP, val DOUBLE>)
. . . . . > TYPE pubsub
. . . . . > LOCATION 'projects/sc-9366-nga-dev/topics/marcin-atm22-stream-1'
. . . . . > TBLPROPERTIES '{"format": "json", "timestampAttributeKey":"ts"}';
No rows affected (0.819 seconds)
0: BeamSQL> CREATE EXTERNAL TABLE stream2 (
. . . . . > event_timestamp TIMESTAMP, attributes MAP<VARCHAR, VARCHAR>,
. . . . . > payload ROW<ts TIMESTAMP, val DOUBLE>)
. . . . . > TYPE pubsub
. . . . . > LOCATION 'projects/sc-9366-nga-dev/topics/marcin-atm22-stream-2'
. . . . . > TBLPROPERTIES '{"format": "json", "timestampAttributeKey":"ts"}';
No rows affected (0.019 seconds)
Pay special attention for timestampAttributeKey
property, the Pubsub attribute used for tracking event time.
Create external BigQuery table for the results:
1
2
3
4
5
6
7
0: BeamSQL> CREATE EXTERNAL TABLE results (
. . . . . > created_at TIMESTAMP, window_start TIMESTAMP, window_end TIMESTAMP,
. . . . . > samples INT, avg_diff DOUBLE)
. . . . . > TYPE bigquery
. . . . . > LOCATION 'sc-9366-nga-dev:marcin_atm22.results'
. . . . . > TBLPROPERTIES '{"writeDisposition": "WRITE_APPEND"}';
No rows affected (0.013 seconds)
WRITE_APPEND
disposition for the BigQuery writer, the pipeline deploys and runs even if the table already exists.
Check the table schema:
1
2
3
4
5
6
7
8
9
10
0: BeamSQL> !describe results
+-----------+-------------+------------+--------------+-----------+--------------+
| TABLE_CAT | TABLE_SCHEM | TABLE_NAME | COLUMN_NAME | DATA_TYPE | TYPE_NAME |
+-----------+-------------+------------+--------------+-----------+--------------+
| | beam | results | created_at | 93 | TIMESTAMP(6) |
| | beam | results | window_start | 93 | TIMESTAMP(6) |
| | beam | results | window_end | 93 | TIMESTAMP(6) |
| | beam | results | samples | 4 | INTEGER |
| | beam | results | avg_diff | 8 | DOUBLE |
+-----------+-------------+------------+--------------+-----------+--------------+
Try to query the table. You will get an error because table doesn’t exist yet, the job will create the table just before the first insert.
1
2
3
0: BeamSQL> SELECT COUNT(*) from RESULTS;
Error: Error while executing SQL "SELECT COUNT(*) from RESULTS": com.google.api.gax.rpc.NotFoundException: io.grpc.StatusRuntimeException: NOT_FOUND:
Not found: Table sc-9366-nga-dev:marcin_atm22.results (state=,code=0)
In theory, you should be able to run SQL pipeline on Direct runner. But in practice the following windowed query on the single stream takes ~15 minutes to finish. I don’t know why Direct runner is so slow for tiny amount of data, I would expect decent performance on Macbook M1 Pro.
1
2
3
4
5
6
7
8
SELECT
TUMBLE_START(event_timestamp, INTERVAL '5' SECOND) AS window_start,
TUMBLE_END(event_timestamp, INTERVAL '5' SECOND) AS window_end,
AVG(stream1.payload.val) AS value_avg,
COUNT(*) AS samples
FROM stream1
GROUP BY TUMBLE(event_timestamp, INTERVAL '5' SECOND)) s1
LIMIT 5;
Pay special attention for the LIMIT
clause. Direct runner collects results in the global window, and you have to specify non-default trigger to get any result.
The LIMIT
clause is translated into data-driven trigger, to emit the results after 5 elements.
Change default Direct runner to Dataflow runner and set a few mandatory pipeline options:
1
2
3
4
5
6
0: BeamSQL> SET runner = DataflowRunner;
No rows affected (0.017 seconds)
0: BeamSQL> set region = 'europe-west1';
No rows affected (0.004 seconds)
0: BeamSQL> set gcpTempLocation = 'gs://sc-9366-nga-dev-marcin/tmp';
No rows affected (0.015 seconds)
Execute final query and wait a few minutes until Dataflow job creates subscriptions and connects to Pubsub topics:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
0: BeamSQL> INSERT INTO results
. . . . . > SELECT
. . . . . > CURRENT_TIMESTAMP, s1.window_start, s1.window_end,
. . . . . > s1.samples + COALESCE(s2.samples, 0),
. . . . . > s1.value_avg - COALESCE(s2.value_avg, 0.0)
. . . . . > FROM (
. . . . . > SELECT
. . . . . > TUMBLE_START(event_timestamp, INTERVAL '5' SECOND) AS window_start,
. . . . . > TUMBLE_END(event_timestamp, INTERVAL '5' SECOND) AS window_end,
. . . . . > AVG(stream1.payload.val) AS value_avg,
. . . . . > COUNT(*) AS samples
. . . . . > FROM stream1
. . . . . > GROUP BY TUMBLE(event_timestamp, INTERVAL '5' SECOND)) s1
. . . . . > LEFT OUTER JOIN (
. . . . . > SELECT
. . . . . > TUMBLE_START(event_timestamp, INTERVAL '5' SECOND) AS window_start,
. . . . . > TUMBLE_END(event_timestamp, INTERVAL '5' SECOND) AS window_end,
. . . . . > AVG(stream2.payload.val) AS value_avg,
. . . . . > COUNT(*) AS samples
. . . . . > FROM stream2
. . . . . > GROUP BY TUMBLE(event_timestamp, INTERVAL '5' SECOND)) s2
. . . . . > ON s1.window_start = s2.window_start AND s1.window_end = s2.window_end;
(...)
Executing pipeline on the Dataflow Service, which will have billing implications related to Google Compute Engine usage and other Google Cloud Services.
To access the Dataflow monitoring console, please navigate to https://console.cloud.google.com/dataflow/jobs/europe-west1/2022-08-08_11_52_01-1894085106596231231?project=sc-9366-nga-dev
(...)
INFO: To cancel the job using the 'gcloud' tool, run:
> gcloud dataflow jobs --project=sc-9366-nga-dev cancel --region=europe-west1 2022-08-07_04_04_14-1079245398928473428
No rows affected (26.455 seconds)
Beam SQL shell is convenient for experimentation.
But for production deployment I would prefer
gcloud Dataflow SQL query command or
DataflowStartSqlJobOperator Apache Airflow operator.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
$ gcloud dataflow sql query \
--job-name=marcin-atm22-gcloud \
--region=europe-west1 \
--bigquery-project=sc-9366-nga-dev \
--bigquery-dataset=marcin_atm22 \
--bigquery-table=gcloud-results \
--bigquery-write-disposition=write-append \
"
CREATE EXTERNAL TABLE stream1 (
event_timestamp TIMESTAMP, attributes MAP<VARCHAR, VARCHAR>,
payload ROW<ts TIMESTAMP, val DOUBLE>)
TYPE pubsub
LOCATION 'projects/sc-9366-nga-dev/topics/marcin-atm22-stream-1'
TBLPROPERTIES '{\"format\": \"json\", \"timestampAttributeKey\":\"ts\"}';
CREATE EXTERNAL TABLE stream2 (
event_timestamp TIMESTAMP, attributes MAP<VARCHAR, VARCHAR>,
payload ROW<ts TIMESTAMP, val DOUBLE>)
TYPE pubsub
LOCATION 'projects/sc-9366-nga-dev/topics/marcin-atm22-stream-2'
TBLPROPERTIES '{\"format\": \"json\", \"timestampAttributeKey\":\"ts\"}';
SELECT
CURRENT_TIMESTAMP, s1.window_start, s1.window_end,
s1.samples + COALESCE(s2.samples, 0),
s1.value_avg - COALESCE(s2.value_avg, 0.0)
FROM (
SELECT
TUMBLE_START(event_timestamp, INTERVAL '5' SECOND) AS window_start,
TUMBLE_END(event_timestamp, INTERVAL '5' SECOND) AS window_end,
AVG(stream1.payload.val) AS value_avg,
COUNT(*) AS samples
FROM stream1
GROUP BY TUMBLE(event_timestamp, INTERVAL '5' SECOND)) s1
LEFT OUTER JOIN (
SELECT
TUMBLE_START(event_timestamp, INTERVAL '5' SECOND) AS window_start,
TUMBLE_END(event_timestamp, INTERVAL '5' SECOND) AS window_end,
AVG(stream2.payload.val) AS value_avg,
COUNT(*) AS samples
FROM stream2
GROUP BY TUMBLE(event_timestamp, INTERVAL '5' SECOND)) s2
ON s1.window_start = s2.window_start AND s1.window_end = s2.window_end;
"
job:
createTime: '2022-08-12T10:11:24.261616Z'
currentStateTime: '1970-01-01T00:00:00Z'
id: 2022-08-12_03_11_23-11958657005279157033
location: europe-west1
name: marcin-atm22-gcloud
projectId: sc-9366-nga-dev
startTime: '2022-08-12T10:11:24.261616Z'
Unfortunately I couldn’t manage deployment using gcloud command
.
After a few minutes I’ve got the following Dataflow job error on StackDriver console:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
com.google.cloud.dataflow.sqllauncher.BadTemplateArgumentsException: Error in SQL query: Syntax error: Expected ")" or "," but got "<" [at 2:73]
at com.google.cloud.dataflow.sqllauncher.DataflowSqlLauncher.buildPipeline(DataflowSqlLauncher.java:136)
at com.google.cloud.dataflow.sqllauncher.DataflowSqlLauncher.buildAndRunPipeline(DataflowSqlLauncher.java:108)
at com.google.cloud.dataflow.sqllauncher.DataflowSqlLauncher.main(DataflowSqlLauncher.java:74)
Caused by: com.google.zetasql.SqlException: Syntax error: Expected ")" or "," but got "<" [at 2:73]
at com.google.zetasql.Analyzer.extractTableNamesFromNextStatement(Analyzer.java:251)
at org.apache.beam.sdk.extensions.sql.zetasql.SqlAnalyzer.extractTableNames(SqlAnalyzer.java:64)
at org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLPlannerImpl.rel(ZetaSQLPlannerImpl.java:87)
at org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner.convertToBeamRelInternal(ZetaSQLQueryPlanner.java:198)
at org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner.convertToBeamRel(ZetaSQLQueryPlanner.java:186)
at org.apache.beam.sdk.extensions.sql.impl.BeamSqlEnv.parseQuery(BeamSqlEnv.java:112)
at org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:171)
at org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:110)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:548)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:499)
at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:56)
at org.apache.beam.sdk.Pipeline.apply(Pipeline.java:192)
at com.google.cloud.dataflow.sqllauncher.DataflowSqlLauncher.buildPipelineOrThrow(DataflowSqlLauncher.java:194)
at com.google.cloud.dataflow.sqllauncher.DataflowSqlLauncher.buildPipeline(DataflowSqlLauncher.java:124)
... 2 more
It looks that the query parser isn’t able to parse external table definition for the Pubsub topic. Checkmate!
If you know how to run the query with external table definitions using gcloud dataflow sql query
, please let me know.
Back to the job deployed on Dataflow runner from SQL shell and generate some traffic from the sensors. I prepared synthetic data generator implemented as Python script:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import argparse
import json
from datetime import datetime, timezone, timedelta
from random import random
from time import sleep
from google.cloud import pubsub_v1
def generate_event(delay):
ts = datetime.now(tz=timezone.utc) - timedelta(seconds=delay)
return {
'ts': ts.isoformat(timespec='seconds'),
'val': round(random(), 2)
}
def publish_event(publisher, topic, event):
data = json.dumps(event).encode('utf-8')
publisher.publish(topic, data=data, ts=event['ts'])
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--topic', required=True)
parser.add_argument("--delay", default=0, type=int)
args = parser.parse_args()
publisher = pubsub_v1.PublisherClient()
while True:
event = generate_event(args.delay)
print(event)
publish_event(publisher, args.topic, event)
sleep(1)
Do you remember timestampAttributeKey
property for Pubsub external table definition?
Generator sets ts
attribute for each published message based on event time (line 20).
Using this attribute Dataflow runner is able to track data freshness in the stream!
Start data generators for topics stream-1
and stream-2
:
1
2
3
4
5
$ python generator.py --topic projects/sc-9366-nga-dev/topics/marcin-atm22-stream-1
{'ts': '2022-08-07T10:28:28+00:00', 'val': 0.69}
{'ts': '2022-08-07T10:28:29+00:00', 'val': 0.67}
{'ts': '2022-08-07T10:28:30+00:00', 'val': 0.89}
(...)
1
2
3
4
5
$ python generator.py --topic projects/sc-9366-nga-dev/topics/marcin-atm22-stream-2
{'ts': '2022-08-07T10:28:35+00:00', 'val': 0.53}
{'ts': '2022-08-07T10:28:36+00:00', 'val': 0.76}
{'ts': '2022-08-07T10:28:37+00:00', 'val': 0.94}
(...)
Configure Beam SQL to use Direct runner again:
1
2
0: BeamSQL> SET runner = DirectRunner;
No rows affected (0.017 seconds)
Try to execute SQL query on Pubsub topics, you should get a decoded sensor payload printed on the console.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
0: BeamSQL> SELECT current_timestamp, event_timestamp, stream1.payload.ts, stream1.payload.val FROM stream1 LIMIT 5;
+---------------------+---------------------+---------------------+------+
| current_timestamp | event_timestamp | ts | val |
+---------------------+---------------------+---------------------+------+
| 2022-08-07 10:39:13 | 2022-08-07 10:39:07 | 2022-08-07 10:39:07 | 0.14 |
| 2022-08-07 10:39:13 | 2022-08-07 10:39:05 | 2022-08-07 10:39:05 | 0.2 |
| 2022-08-07 10:39:13 | 2022-08-07 10:39:09 | 2022-08-07 10:39:09 | 0.05 |
| 2022-08-07 10:39:13 | 2022-08-07 10:39:08 | 2022-08-07 10:39:08 | 0.95 |
| 2022-08-07 10:39:13 | 2022-08-07 10:39:01 | 2022-08-07 10:39:01 | 0.63 |
+---------------------+---------------------+---------------------+------+
5 rows selected (26.292 seconds)
0: BeamSQL> SELECT current_timestamp, event_timestamp, stream2.payload.ts, stream2.payload.val FROM stream2 LIMIT 5;
+---------------------+---------------------+---------------------+------+
| current_timestamp | event_timestamp | ts | val |
+---------------------+---------------------+---------------------+------+
| 2022-08-07 10:42:52 | 2022-08-07 10:42:49 | 2022-08-07 10:42:49 | 0.99 |
| 2022-08-07 10:42:52 | 2022-08-07 10:42:50 | 2022-08-07 10:42:50 | 0.36 |
| 2022-08-07 10:42:52 | 2022-08-07 10:42:47 | 2022-08-07 10:42:47 | 0.78 |
| 2022-08-07 10:42:52 | 2022-08-07 10:42:48 | 2022-08-07 10:42:48 | 0.52 |
| 2022-08-07 10:42:52 | 2022-08-07 10:42:40 | 2022-08-07 10:42:40 | 0.72 |
+---------------------+---------------------+---------------------+------+
5 rows selected (16.239 seconds)
Now, it’s time to examine logical graph of the Dataflow job. It should look like the graph presented below:
Dataflow optimizes the logical graph into physical execution graph. As you can see some steps are fused together but the overall shape of the graph stays unchanged.
Examine execution graph for data freshness:
Finally, examine the results generated by the job by querying BigQuery table:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0: BeamSQL> SELECT
. . . . . > window_start,
. . . . . > window_end,
. . . . . > created_at,
. . . . . > TIMESTAMPDIFF(SECOND, created_at, window_end) AS delay_secs,
. . . . . > samples AS no_of_samples,
. . . . . > ROUND(avg_diff, 2) AS avg_diff
. . . . . > FROM results
. . . . . > ORDER BY window_end DESC
. . . . . > LIMIT 1;
+---------------------+---------------------+---------------------+------------+---------------+----------+
| window_start | window_end | created_at | delay_secs | no_of_samples | avg_diff |
+---------------------+---------------------+---------------------+------------+---------------+----------+
| 2022-08-08 19:07:10 | 2022-08-08 19:07:15 | 2022-08-08 19:07:43 | -28 | 10 | 0.34 |
+---------------------+---------------------+---------------------+------------+---------------+----------+
19:07:10
(inclusive) and 19:07:15
(exclusive).INSERT INTO results SELECT ...
statement was calculated at 19:07:43
, 28 seconds after end of the window.F22
is 31 seconds, it means that the last step takes around 31 - 28 = 3
seconds.Simulate late data from a phantom sensor.
Start second generator for stream-2
topic but with configured delay of 60 seconds:
1
$ python generator.py --topic projects/sc-9366-nga-dev/topics/marcin-atm22-stream-2 --delay 60
Examine execution graph again.
Reported data freshness for step F25
is now 1 min 12 secs
but there is no impact for step F27
where data freshness stays unchanged:
Compare the reported watermarks for steps F25
and F27
.
The watermark for step F25
is around minute behind the watermark for step F27
:
Finally, examine the results generated by the job by querying BigQuery table:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0: BeamSQL> SELECT
. . . . . > window_start,
. . . . . > window_end,
. . . . . > created_at,
. . . . . > TIMESTAMPDIFF(SECOND, created_at, window_end) AS delay_secs,
. . . . . > samples AS no_of_samples,
. . . . . > ROUND(avg_diff, 2) AS avg_diff
. . . . . > FROM results
. . . . . > ORDER BY window_end DESC
. . . . . > LIMIT 1;
+---------------------+---------------------+---------------------+------------+---------------+----------+
| window_start | window_end | created_at | delay_secs | no_of_samples | avg_diff |
+---------------------+---------------------+---------------------+------------+---------------+----------+
| 2022-08-08 20:07:15 | 2022-08-08 20:07:20 | 2022-08-08 20:08:43 | -83 | 15 | 0.21 |
+---------------------+---------------------+---------------------+------------+---------------+----------+
For larger throughput you could also observe, that Dataflow runner slows down the consumption from up-to-date subscription if data in another stream is late. It doesn’t make any sense to allocate job resources for the data received from the up-to-date sensors if the computation must wait for late data. Streaming runner must offer some kind of back-pressure mechanism to align the processing to the data freshness.
How Dataflow is able to estimate data freshness if it stops consuming data due to lateness in another sources? Dataflow creates extra subscription just for event time tracking. For presented SQL pipeline there are four subscriptions instead of two!
Unfortunately, tracking subscriptions double Pubsub message delivery costs. Even if the timestamp attribute is enough for tracking purposes, you will pay for the whole message delivery.
Move to the most surprising part of the blog post.
Stop all generators for stream-2
topic, and watch watermark metric for step F25
:
stream-2
.stream-1
.
Data freshness for both streams are the same (~10–20 seconds).If there is no traffic on the Pubsub topic for ~3 minutes, Dataflow assumes that the stream is fully up-to-date and releases the watermark. You couldn’t change this behaviour.
You should observe similar results for the moment when the watermark is released:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
+---------------------+---------------------+---------------------+------------+---------------+----------+
| created_at | window_start | window_end | delay_secs | no_of_samples | avg_diff |
+---------------------+---------------------+---------------------+------------+---------------+----------+
| 2022-08-16T07:38:02 | 2022-08-16T07:36:40 | 2022-08-16T07:36:45 | 77 | 15 | -0.16 |
| 2022-08-16T07:38:09 | 2022-08-16T07:36:45 | 2022-08-16T07:36:50 | 79 | 15 | 0.08 |
| 2022-08-16T07:40:07 | 2022-08-16T07:36:50 | 2022-08-16T07:36:55 | 192 | 14 | -0.26 |
| 2022-08-16T07:40:11 | 2022-08-16T07:37:45 | 2022-08-16T07:37:50 | 141 | 10 | 0.18 |
| 2022-08-16T07:40:11 | 2022-08-16T07:37:10 | 2022-08-16T07:37:15 | 176 | 10 | -0.04 |
| 2022-08-16T07:40:11 | 2022-08-16T07:37:40 | 2022-08-16T07:37:45 | 146 | 10 | -0.26 |
| 2022-08-16T07:40:11 | 2022-08-16T07:37:25 | 2022-08-16T07:37:30 | 161 | 10 | -0.18 |
(...)
| 2022-08-16T07:40:14 | 2022-08-16T07:38:30 | 2022-08-16T07:38:35Z | 99 | 5 | 0.47 |
| 2022-08-16T07:40:14 | 2022-08-16T07:39:15 | 2022-08-16T07:39:20Z | 54 | 5 | 0.31 |
+---------------------+---------------------+---------------------+------------+---------------+----------+
07:28:09
the results for the window [07:36:45, 07:36:50)
was calculated based on data from all three sensors.07:40:07
when Dataflow released the watermark.
The total delay is about 3 minutes, 192 seconds to be precise.07:40:11
Dataflow has back-filled the gap with data collected from stream-1
and stream-2
when the watermark was hold.07:40:14
there are only 5 samples from stream-1
, there is no data in stream-2
.[07:37:45, 07:37:50)
was produced before results for window [07:37:10, 07:37:15)
.You could be a little lost right now, the reasoning about the streaming pipelines is a demanding task 😁 Let me show data visualization for the whole experiment. On the first chart there is a pipeline delay, the difference between results event time and the processing time. On the second chart you can see how many samples were used for calculation.
sleep(1)
call between samples isn’t enough to generate exactly one event per second.stream-2
back-fill – part 1:
stream-2
back-fill – part 2:
Timeline in processing time domain is also interesting and shows when results are produced:
If not specified explicitly, Apache Beam SQL execute job on a single n1-standard-4
worker.
Remember to stop the job after experimentation to avoid unnecessary costs.
1
2
$ gcloud dataflow jobs --project=sc-9366-nga-dev cancel --region=europe-west1 2022-08-07_04_04_14-1079245398928473428
Cancelled job [2022-08-07_04_04_14-1079245398928473428]
Below you can find a short summary of the blog post and my experimentation with Apache Beam SQL on Dataflow runner. I’ve tried to show some advantages but also point out challenges if you want to run SQL streaming pipeline.
BeamCoGBKJoinRel_592
or physical steps like F23
.I would recommend to use regular Apache Beam SDK (Java, Python or Go) if you want to run streaming pipelines. Apache Beam SQL looks promising, but it’s a long way to do the first production deployment managed fully by SQL analyst.
]]>