
Building Big Data Pipelines with Apache Beam
By :

In this task, we will see how to implement all aspects of a splittable DoFn
process and we will see how to use its power and extensibility. So, let's create a streaming source from a plain filesystem! We will explain what we mean by that in the following problem definition.
We want to create a streaming-like source from a directory on a filesystem that will work by watching a specified directory for new files. Once a new file appears, it will grab it and output its content split into individual (text) lines for downstream processing. The source will compute a watermark as a maximal timestamp for all of the files in the specified directory. For simplicity, ignore recursive sub-directories and treat all files as immutable.
Let's illustrate that in the following discussion for clarity.
The problem effectively consists...