
Actionable Insights with Amazon QuickSight
By :

In this section, we will go through a hands-on example to create a basic data Lake House architecture. This tutorial will use the AWS CLI and the AWS console. By the end of this section, we will have spun up a working data lake and a data warehouse environment with demo data loaded.
Important note
The resources for this tutorial might introduce charges to your AWS account. Once you finish with the exercise, make sure you clean up the resources to prevent incurring further charges.
In this step, we will add the data lake storage. Then we will upload a demo dataset and will discover its schema automatically.
Let's begin:
SF482XHS7M
.We will use the random string in the data lake name, to ensure it is globally unique.
Let's use my-data-lake-<random string>
as the bucket name.
% aws s3api create-bucket --bucket data-lake-xxxxxxxxxx --region us-east-1
And the response should look like this:
{ "Location": "/data-lake-xxxxxxxxxx" }
Now let's add some data. For this exercise, we will use a subset of the New York City Taxi and Limousine Commission (TLC) Trip Record Data:
% aws s3 ls 's3://nyc-tlc/trip data/' --no-sign-request
This command will return all the files in the open S3 location:
2016-08-11 15:32:21 85733063 fhv_tripdata_2015-01.csv 2016-08-11 15:33:04 97863482 fhv_tripdata_2015-02.csv 2016-08-11 15:33:40 102220197 fhv_tripdata_2015-03.csv … 2021-02-26 16:54:00 138989555 yellow_tripdata_2020-11.csv 2021-02-26 16:54:00 134481400 yellow_tripdata_2020-12.csv
We don't need to download all of them. For this tutorial, we will copy only the files for 2020.
exclude
and include
parameters to apply a pattern to match the desired filenames. The command to copy is as follows: % aws s3 cp "s3://nyc-tlc/trip data/" s3://data-lake-xxxxxxxxxx/yellowtrips/ --recursive --exclude "*" --include "yellow_tripdata_2020*"
aws s3 ls
command, which lists the files under a specific S3 location:% aws s3 ls s3://data-lake-xxxxxxxxxx/yellowtrips/ 2021-03-27 16:53:41 593610736 yellow_tripdata_2020-01.csv 2021-03-27 16:53:41 584190585 yellow_tripdata_2020-02.csv 2021-03-27 16:53:42 278288608 yellow_tripdata_2020-03.csv 2021-03-27 16:53:41 21662261 yellow_tripdata_2020-04.csv 2021-03-27 16:53:43 31641590 yellow_tripdata_2020-05.csv 2021-03-27 16:53:42 50277193 yellow_tripdata_2020-06.csv 2021-03-27 16:53:44 73326707 yellow_tripdata_2020-07.csv 2021-03-27 16:53:46 92411545 yellow_tripdata_2020-08.csv 2021-03-27 16:53:50 123394595 yellow_tripdata_2020-09.csv 2021-03-27 16:53:54 154917592 yellow_tripdata_2020-10.csv 2021-03-27 16:53:57 138989555 yellow_tripdata_2020-11.csv 2021-03-27 16:53:58 134481400 yellow_tripdata_2020-12.csv
Note
You can use data in a shared data lake as part of your data lake without the need to actually copy it across to your data lake.
The next step is to identify the schema of the dataset. For this purpose, we will use the AWS Glue crawlers. AWS Glue crawlers crawl through the data to detect the schema. If a schema can be determined (remember there is no guarantee that the data has a specific schema) then Glue crawlers will populate the Glue Catalog with the schemas identified after crawling the data. Glue tables always belong to a Glue database. A database in Glue is just a logical repository of tables in the Glue Catalog:
create-database
command:% aws glue create-database --database-input "{\"Name\":\"my-data-lake-db\"}" --region us-east-1
get-databases
command:% aws glue get-databases --region us-east-1 { "DatabaseList": [ { "Name": "default", "CreateTime": 1553517157.0 }, { "Name": "my-data-lake-db", "CreateTime": 1616865129.0 } ] }
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::data-lake-xxxxxxxxxx", "arn:aws:s3:::data-lake-xxxxxxxxxx/*" ] } ] }
The preceding policy document allows the policy holder to use the S3 ListBucket
and the GetObject
API. The crawler will use ListBucket
to list the objects in our data lake bucket and getObject
to read objects as it crawls data. This policy restricts access to the data lake bucket only.
vim
, but you can use any text editor:% vim policy
create-policy
CLI command:% aws iam create-policy --policy-name DataLakeReadAccess --policy-document file://policy
The preceding command created the policy and we should get a confirmation JSON object back. Note the policy ARN, as we will use it in a later step.
{ "Version": "2012-10-17", "Statement": [ { "Action": "sts:AssumeRole", "Effect": "Allow", "Principal": { "Service": "glue.amazonaws.com" } } ] }
role-policy
and copy in the preceding JSON document:% vim role-policy
This role policy document allows the Glue service to assume the role we will create.
iam create-role
CLI command:% aws iam create-role --role-name GlueCrawlerRole --assume-role-policy-document file://role-policy
We should get a confirmation JSON message after running the command.
AWSGlueServiceRole
policy (this is managed by AWS) and the DataLakeReadAccess
policy we created earlier. To attach policies to the IAM role we will use the iam attach-role-policy
command. Let's start with the AWSGlueServiceRole
policy:% aws iam attach-role-policy --role-name GlueCrawlerRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
DataLakeReadAccess
policy. We will need the policy ARN that we captured earlier. The policy ARN should look like the following line:arn:aws:iam::<accountid>:policy/DataLakeReadAccess
And the command should look like the following:
% aws iam attach-role-policy --role-name GlueCrawlerRole --policy-arn arn:aws:iam::<ACCOUNT-ID>:policy/DataLakeReadAccess
glue create-crawler
CLI command. Make sure you replace the role ARN and the data lake location with the values for your environment:% aws glue create-crawler --name qs-book-crawler --role arn:aws:iam::xxxxxxxxxxxx:role/GlueCrawlerRole --database-name my-data-lake-db --targets "{\"S3Targets\":[{\"Path\":\"s3://data-lake-xxxxxxxxxx/yellowtrips\"}]}" --region us-east-1
glue start-crawler
command:% aws glue start-crawler --name qs-book-crawler --region us-east-1
After 1-2 minutes, the Glue crawler should populate the database.
glue get-tables cli
command:% aws glue get-tables --database-name my-data-lake-db
my-data-lake-db
. Then click on View tables under my-data-lake-db
. It should look like the following screenshot:Figure 1.5 – Glue console
Tip
You can click the checkbox to select the table and then, under Action, you can choose Preview Data. This will open the Amazon Athena console and run an Athena query that returns 10 values from the table.
Let's create our data warehouse next.
To create the data warehouse, we will use the redshift create-cluster
CLI command, or you can use the AWS Console:
%aws redshift create-cluster --node-type dc2.large --number-of-nodes 2 --master-username admin --master-user-password R3dsh1ft --cluster-identifier mycluster --region us-east-1
This command should give a response with the cluster metadata. After a few minutes, our cluster will be up and running.
Note
This command will create a Redshift cluster with a public IP address. This is something that should be avoided in real-world scenarios. The instructions provided are oversimplified for the purposes of this tutorial as this book is not focused on Amazon Redshift.
First, let's create an IAM role that we will assign to the Redshift cluster. We will use this role when using the Redshift Spectrum feature to query data in S3. We want the cluster to be able to write and read to our S3 location. We also want the cluster to be able to have read access to the Glue Catalog:
{ "Version": "2012-10-17", "Statement": [ { "Action": "sts:AssumeRole", "Effect": "Allow", "Principal": { "Service": "redshift.amazonaws.com" } } ] }
% vim role-policy-redshift
iam create-role
command: % aws iam create-role --role-name RedshiftSpectrumRole --assume-role-policy-document file://role-policy-redshift
Note the role ARN, as we will use it later to attach it to the cluster.
AWSFullS3Access
and AWSFullGlueAccess
:% aws iam attach-role-policy --role-name RedshiftSpectrumRole --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess % aws iam attach-role-policy --role-name RedshiftSpectrumRole --policy-arn arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess
redshift modify-cluster-iam-roles
CLI command. Note that you need to update the role ARN with the value from your environment:% aws redshift modify-cluster-iam-roles --cluster-identifier mycluster --add-iam-roles arn:aws:iam::<ACCOUNT-ID>:role/RedshiftSpectrumRole --region us-east-1
dev
as the database name and admin
as the username. Note
We didn't set a database name earlier. Redshift uses dev
as the default value.
yellowtrips_3mo
, as we will only store 3 months' worth of data: create table yellowtrips_3mo (vendorid varchar(10), tpep_pickup_datetime datetime, tpep_dropoff_datetime datetime, passenger_count int, trip_distance float, ratecodeid varchar(10), store_and_fwd_flag char(1), pulocationid varchar(10), dolocationid varchar(10), payment_type varchar(10), fare_amount float, extra float, mta_tax float, tip_amount float, tolls_amount float, improvement_surcharge float, total_amount float, congestion_surcharge float);
COPY
command, as follows:copy yellowtrips_3mo from 's3://data-lake-afo59dkg84/yellowtrips/yellow_tripdata_2020-10.csv' iam_role 'arn:aws:iam::xxxxxxxxxxxx:role/ RedshiftSpectrumRole' FORMAT AS CSV dateformat 'auto' ignoreheader 1; copy yellowtrips_3mo from 's3://data-lake-afo59dkg84/yellowtrips/yellow_tripdata_2020-11.csv' iam_role 'arn:aws:iam::xxxxxxxxxxxx:role/ RedshiftSpectrumRole' FORMAT AS CSV dateformat 'auto' ignoreheader 1; copy yellowtrips_3mo from 's3://data-lake-afo59dkg84/yellowtrips/yellow_tripdata_2020-12.csv' iam_role 'arn:aws:iam::xxxxxxxxxxxx:role/ RedshiftSpectrumRole' FORMAT AS CSV dateformat 'auto' ignoreheader 1;
create external schema spectrum_schema from data catalog database 'my-data-lake-db' iam_role 'arn:aws:iam::xxxxxxxxxxxx:role/RedshiftSpectrumRole' create external database if not exists;
count(*)
query:select count(*) from public.yellowtrips_3mo; select count(*) from spectrum_schema.yellowtrips;
The first query will run against the recent data in the data warehouse. The second will run against the first query using the Spectrum nodes using the data in the data lake. As expected, the number of records in the data lake should be much higher than the number of records in the data warehouse. Specifically, the query result was 24,648,499 for the year 2020 and 4,652,013 records for the last 3 months of 2020.
Note
The Spectrum queries use the Spectrum nodes and are charged separately from the Redshift cluster. Every query incurs an added cost based on the data it needs to scan. Refer to the AWS pricing for details.
Feel free to experiment with the data and trigger a few queries to understand the dataset. When you finish with the Redshift cluster, you can pause the cluster so that you stop the on-demand billing. Once the cluster is paused you will only pay for the cluster storage.
Change the font size
Change margin width
Change background colour