
Building Big Data Pipelines with Apache Beam
By :

The PCollection name is an abbreviation of parallel collection, which suggests that it is a collection of elements that are somewhat distributed among multiple workers. And that is exactly what a parallel collection is. To be able to communicate with the individual elements of this collection, each element needs a serialized representation. That is, we need a piece of code that takes a raw (in-memory) object and produces a byte representation that can be sent over the wire. After receiving on the remote side, we need another piece of code that will take this byte representation and recreate the original in-memory object (or rather, a copied version of the original). And that is exactly what coders are for.
We have already used Coder
in our test cases. Recall how we constructed our TestStream
object:
TestStream.create(StringUtf8Coder.of())
The reason we need to specify a Coder
object here is that every PCollection...