
Mastering Hadoop
By :

Filtering inputs to a job based on certain attributes is often required. Data-level filtering can be done within the Maps, but it is more efficient to filter at the file level before the Map task is spawned. Filtering enables only interesting files to be processed by Map tasks and can have a positive effect on the runtime of the Map by eliminating unnecessary file fetch. For example, files generated only within a certain time period might be required for analysis.
Let's use the 441-grant proposal file corpus subset to illustrate filtering. Let's process those files whose names match a particular regular expression and have a minimum file size. Both of these are specified as job parameters—filter.name
and filter.min.size
, respectively. Implementation entails extending the Configured
class and implementing the PathFilter
interface as shown in the following snippet. The Configured
class is the base class for things that can be configured using Configuration...
Change the font size
Change margin width
Change background colour