
Cloud Scale Analytics with Azure Data Services
By :

There are several definitions of Data Warehousing on the internet. The narrower ones characterize a warehouse as the database and the model used in the database; the wider descriptions look at the term as a method, and a suitable collection of all organizational and technological components that make up a BI solution. They talk about everything from the Extract Transform Load tool (ETL tool) to the database, the model, and, of course, the reporting and dashboarding solution.
When we look at the Data Warehousing method in general, at its heart, we find a database that offers a certain table structure. We almost always find two main types of artifacts in the database: Facts and Dimensions.
Facts provide all the measurable information that we want to analyze; for example, the quantities of products sold per customer, per region, per sales representative, and per time. Facts are normally quite narrow objects, but with a lot of rows stored.
In the Dimensions, we will find all the descriptive information that can be linked to the Facts for analysis. Every piece of information that a user puts on their report or dashboard to aggregate and group the fact data, filter it, and view it is collected in the Dimensions. All the data related to customer information, such as Product, Contract, Address, and so on, that might need to be analyzed and correlated is stored here. Typically, these objects are stored as tables in the database and are joined using their key columns. Dimensions are normally wide objects, sometimes with controlled redundancy, that look at the given modeling method.
Three main methods for modeling the Facts and Dimensions within a Data Warehouse database have crystalized over the years of its evolution:
But how does data finally land in the Data Warehouse database? The process and the related tools are named Extract, Transform, Load (ETL), but depending on the sequence we implement it in, it may be referred to as ELT. You'll find several possible ways to implement data loading into a Data Warehouse. This can be done by implementing specialized ETL tools such as Azure Data Factory in the cloud, SQL Server Integration services (SSIS), Informatica, Talend, or IBM Data Stage, for example.
The biggest advantage of these tools is the availability of wide catalogues of ready-to-use source and target connectors. They can connect directly to a source, query the needed data, and even transform it while being transported to the target. In the end, data is loaded into the Data Warehouse database. Other advantages include its graphical interfaces, where complex logic can be implemented on a "point-and-click" basis, which is very easy to understand and maintain.
There are other options as well. Data is often pushed by source applications and a direct connection for the data extraction process is not wanted at all. Many times, files are provided that are stored somewhere near the Data Warehouse database and then need to be imported. Maybe there is no ETL tool available. Since nearly every database nowadays provides loader tools, the import can be accomplished using those tools in a scripted environment. Once the data has made its way to the database tables, the transformational steps are done using Stored Procedures that then move the data through different DWH stages or layers to the final Core Data Warehouse.
Talking about the Data Warehouse layers, we nearly always find several steps that are processed before the data is provided for reporting or dashboarding. Typically, there are at least the following stages or layers:
Additionally, Data Warehouses have also brought in other types of layers. There might be sublayers for the cleansing area, for example, or the Data Marts, which are used to slice data semantically for the needs of certain user groups, or the Operational Data Store (ODS), which has different definitions, depending on who you ask. Sometimes, it is used as an intermediary data store that can be used for reporting. Other definitions speak of a transient zone that stores data for integration and further transport to the DWH. Sometimes, it is also used as a kind of archived data pool that the Data Warehouse can always be reproduced from. The definitions vary.
In terms of the reporting/dashboarding solution, we will also find several approaches to visualize the information that's provided by the Data Warehouse. There are the typical reporting tools, such as SQL Server Reporting Service or Crystal Reports, for example. These are page-oriented report generators that access the underlying database, get the data, and render it according to the template being used.
The more modern approach, however, has resulted in tools that can access a certain dataset, store it in an internal caching database, and allow interactive reporting and dashboarding based on that data. Self-Service BI tools such as Power BI, QLIK, and Tableau allow you to access the data, create your own visuals, and put them together in dashboards. Nowadays, these tools can even correlate data from different data sources and allow you to analyze data that might not have made it to the Data Warehouse database yet.
You can scale databases with newer, faster hardware, more memory, and faster disks. You can also go from Symmetric Multi-Processing (SMP) to Massively Parallel Processing (MPP) databases. However, the usual restrictions still apply: the more data we need to process, the longer it will take to do so. And there are also workloads that databases will not support, such as image processing.
It's funny that we find a similar mix or maybe even call it confusion when we examine the term Data Lake. Many people refer to a Data Lake as a Hadoop Big Data implementation that delivers one or more clusters of computers, where a distributed filesystem and computational software is installed. It can deal with a distributed Input and Output (I/O) on the one hand but can also do distributed and parallel computation. Such a system adds specialized services for all kinds of workloads, be it just SQL queries against the stored data, in-memory computation, streaming analytics – you name it. Interestingly, as we mentioned while discussing Data Warehouses discussion, the narrower definition of the Data Lake only refers to the storage solution, and less to the method and the collection of services. To be honest, I like the wider definition far better, as it describes the method more holistically.
With the Hadoop Distributed File System (HDFS) from the Apache Software Foundation, a system is delivered that is even capable of storing data distributed over cheap legacy hardware clusters. HDFS splits the files into blocks and will also replicate those blocks within the system. This not only delivers a failsafe environment, but it also generates the biggest advantage of HDFS: parallel access to the data blocks for the consuming compute components. This means that a Spark cluster, for example, can read several data blocks in parallel, and with increased speed, because it can access several blocks of a file with several parallel threads.
With the possibility to hand files to a filesystem and start analyzing them just where they are, we enter another dogma.
MapReduce, the programming paradigm, supports parallel processing many files in this distributed environment. Every participating node runs calculations over a certain subset of all the files that must be analyzed. In the end, the results are aggregated by a driver node and returned to the querying instance. Different to the Data Warehouse, we can start analyzing data on a Schema-On-Read basis. This means we decide on the structure of the files that are processed, right when they are being processed. The Data Warehouse, in comparison, is based on a Schema-On-Write strategy, where the tables must be defined before the data is loaded into the database.
In the Data Lake world, you, as the developer, do not need to plan structures weeks ahead, as you would in the Data Warehouse. Imagine you have stored a vast quantity of files, where each row in a file will reflect an event, and each row also consists of, let's say, 100 columns. Using Schema-On-Read, you can just decide to only count the rows of all the files. You won't need to cut the rows into the columns for this process. For other purposes, you might need to split the rows during the reading process into their own columns to access the detailed information. You might need to predict a machine failure based on the content of the columns. Using a database, you can also mimic this behavior, but you would need to create different objects and selectively store data for each intention.
HDFS is the center of a Data Lake system. But we can't get further with just a filesystem, even one as sophisticated as this one. The Hadoop open source world around HDFS has therefore developed many services to interact and process the content that's kept in distributed storage. There are services such as Hive, a SQL-like Data Warehousing service; Pig, for highly parallel analysis jobs; Spark as a large-scale, in-memory analytical engine; Storm as a distributed real-time streaming analysis engine; Mahout for machine learning; Hbase as a database; Oozie, for workflows; and many more.
Spark clusters, with their ability to run distributed, in-memory analytical processes on distributed datasets, have influenced the market heavily in the recent years. Scala, one of the main programming languages used with Spark, is a Java-based programming language. It can be used to write high-performance routines for ETL/ELT, data cleansing, and transformation, but also for machine learning and artificial intelligence on massive datasets. Python also made it into this technology and is one of the go-to languages here. R, as the statistical programming language of choice for many data scientists, was also a must for the Spark offering. Dealing with data has been important for so many years and guess what – SQL is still a language that was not possible to skip in such an environment.
These services and languages are making the open source Hadoop ecosystem a rich, complex engine for processing and analyzing excessive amounts of data. They are available all over the clusters and can interact with the HDFS to make use of the distributed files and compute.
Tip
Jumping into a big data challenge should, just like a Data Warehouse project, always be a well-prepared and intensively examined project. Just starting it from an "I want that too!" position would be the worst driver you can have, and unfortunately happens far too often. When you start with purpose and find the right tool for that purpose, at least that selection might create the right environment for a successful project. Maybe we will find some alternatives for you throughout this book.
Very similar to the Data Warehouse layers, as we discussed previously, a Data Lake is also structured in different layers. These layers form the stages that the data must go through on its way to be formed into information. The three major zones are pretty much comparable. We can form a landing zone in the Data Lake, where data is written to "as-is" in the exact format as it comes from the source. The transient zone then compares to the cleansing area of the Data Warehouse. Data is transformed, wrangled, and massaged to suit the requirements of the analyzing users. As no one should access either the landing zone or the transient zone, the curated zone is where you're heading, and what you, as the developer, will allow your users to access. Other (similar) concepts talk about Bronze, Silver, and Gold areas, or use other terms. Just like in the Data Warehouse concept, the Data Lakes can include other additional zones. For example, zones for master data, user sandboxes, or logging areas for your routines. We can have several user- or group-related folder structures in the curated zone where data is aggregated, just like the Data Marts of the DWH.
So, yes, you can be flexible and agile. But in a Data Lake with Schema-On-Read, you'll need to decide which attributes you want to analyze or are needed by your machine learning models for training. This will, after all, force you to structure your sources and will therefore force you into a certain project life cycle. You will go through user stories, requirements analysis, structuring and development, versioning, and delivering artifacts.
If you're only analyzing tabular-oriented data, maybe it's worth checking if your hardware, your ETL-tool, and your company databases can scale to the needed volume.
This question, along with the nature of the source's data, complexity, and format, should be taken into account when you're deciding on the technology to use. If you need to process sound files, images, PDFs, or similar, then this is no job for a database. SQL does not offer language elements for this (you can add programming extensions to your database). But here, we have a definitive marker for the possible usage of technologies other than databases.
Once you have analyzed the so-called unstructured data, you will structure the results back into a tabular-oriented result set as an array (a tabular representation) of data. This needs to be delivered somehow to the recipients of your analysis. But how is that done in a Data Lake?
The typical reporting tools still require tables or table-like source data to work with. Often, they will import the source data into their internal database to be able to quickly answer reporting and dashboarding operations by their users. Experiences with these visualization tools have shown that it is not very performant to directly report from a vast number of files from a Data Lake. Data always needs to be condensed into digestible chunks for these purposes and should then also be stored in a suitable and accessible way.
Funnily enough, many of the services in the Hadoop ecosystem are equipped with similar functionality to the ones that databases have offered for ages now and are optimized for. Data Warehouse databases are more mature in many respects: when we look at Hive as the Data Warehouse service in Hadoop, for example, it still can't update the data and can't run queries with nested subqueries in Hive-QL. But over the years, it has been extended with all kinds of database-like functionality (views, indexes, and many more).
Important Note
The Data Lake approaches are still missing a fine-grained security model that can deliver centralized security mechanisms to control Row-Level-Security and Column-Level-Security over data while also allowing an easy-to-implement mechanism for Data Masking.
Change the font size
Change margin width
Change background colour