
Snowflake Cookbook
By :

An important aspect of ETL applications is managing the loading of data. This recipe introduces you to managing incoming data by creating a stage and querying that data for loading into native Snowflake tables. The process is very different from traditional data warehouses as it mixes concepts from modern big data systems. Details around ETL will not be covered here but are deferred till later chapters to explain how an ETL pipeline can be managed.
The following example requires SnowSQL to run the different steps. Apart from that, you will need to have access to an AWS S3 bucket where data can be placed and made available as files.
The following steps start with the creation of a stage, which is used to temporarily store data before it can be copied into Snowflake:
sfuser_ext_stage
. The stage should be accessible to Snowflake:CREATE OR REPLACE STAGE sfuser_ext_stage URL='s3://snowflake-cookbook/Chapter02/r4/';
The response should say that a stage has been created successfully, as shown:
Figure 2.21 – Stage area created
SFUSER_EXT_STAGE
stage pointing to the snowflake-cookbook
S3 bucket:LIST@SFUSER_EXT_STAGE;
This statement should generate the following output in the web UI:
Figure 2.22 – Listing the stage to the S3 bucket
We can see that there are two types of files in the preceding listing: csv
and parquet
. In the case of csv
, the electronic-card-transactions-may-2020-headless.csv
file is a header-less version of the electronic-card-transactions-may-2020.csv
file. There is a parquet
format file as well called userdata1.parquet
. We shall create external tables on both files. An external table is different from usual database tables because unlike tables that point to data inside a database, external tables provide a view on top of files stored in a stage.
These are read-only tables that maintain metadata that's helpful in interpreting the contents of a file, which could be formatted as parquet
, csv
, and so on.
parquet
file can be loaded into an external table. We shall be creating an external table called ext_tbl_userdata1
. The creation of the table would require a location from which data can be read into the table. It would also require a file format. In this case, the file type is parquet
:create or replace external table ext_tbl_userdata1 with location = @sfuser_ext_stage file_format = (type = parquet);
select * from ext_tbl_userdata1;
The following screenshot is only showing some of the key-value pairs due to size constraints:
Figure 2.23 – Output of the query showing key-value pairs
ext_card_data
table is created that has the location pointing to the stage. In this case, the file is located in a …/ch2/r4/csv
subfolder. This gets us to the folder where the file is located. file_format
, in this case, is providing the information that the file is a CSV, and finally, a file filter is provided to constrain the search to CSVs with headless
in their names:create or replace external table ext_card_data with location = @sfuser_ext_stage/csv file_format = (type = csv) pattern = '.*headless[.]csv';
select * from ext_card_data;
The following screenshot of the resulting records shows some of the rows generated by the query. Please note the difference here. There are no meaningful column names in this case, unlike the previous case of the Parquet file:
Figure 2.24 – Output of the select statement
c3
and c2
, and creates the card_sum
and period
aliases, respectively. Moreover, to use the columns effectively, casting has been carried out for each column value:select top 5 value:c3::float as card_sum, value:c2::string as period from ext_card_data;
The result of the query is shown:
Figure 2.25 – Output of the query selecting two columns
drop table ext_card_data; drop table ext_tbl_userdata1;
The stage created in step 1 can be thought of as a reference to a storage location. It is treated as a read-only location that can only be accessed using the appropriate access rights. The S3 bucket in step 1 is a public bucket and does not need credentials to be accessed. In later chapters, when we look at staging in more detail, we shall start delving into securing the staging locations.
Step 2 is dereferencing the specified S3 bucket to list all the files that are available along with the last modified dates.
In step 3, when the Parquet file is loaded into the external table, the table rows have all the field names captured, as shown in step 4. But in step 5, when the CSV-formatted file is loaded into the external table, there are dummy column names created by Snowflake, as can be seen in step 6. This is because a Parquet file has metadata stored inside the file, while a CSV file does not have that metadata embedded in it. This is a major difference and would usually require additional steps as shown in step 7 to generate meaningful column names, plus the casting of data types.
In this recipe, we did not look at how data could be loaded into a stage. It is possible to load data into a stage from a local filesystem. This method will be discussed in the later chapters.
Looking at the metadata limitations for external tables, it can be argued that Avro files can be the best format for staging data as Avro files can specify field names and data types as well. We shall look into an example in a later chapter where we discuss ETL processing.
Please note that the last modified dates for each file in a stage can be a useful mechanism to trigger updating data in an external table and can be used for running the ETL process.
Change the font size
Change margin width
Change background colour