
Serverless Analytics with Amazon Athena
By :

Before we can start running our first query, we will need some data that we would like to analyze. Throughout this book, we will try to make use of open datasets that you can freely access but that also contain interesting information that may mirror your real-world datasets. In this chapter, we will be making use of the NYC Taxi & Limousine Commission's (TLC's) Trip Record Data for New York City's iconic yellow taxis. Yellow taxis have been recording and providing ride data to TLC since 2009. Yellow taxis are traditionally hailed by signaling to a driver who is on duty and seeking a passenger (also known as a street hail). In recent years, yellow taxis have also started to use their own ride-hailing apps such as Curb and Arro to keep pace with emerging ride-hailing technologies from Uber and Lyft. However, yellow taxis remain the only vehicles permitted to respond to street hails from passengers in NYC. For that reason, the dataset often has interesting patterns that can be correlated with other events in the city, such as a concert or inclement weather.
Our exercise will focus on just one of the many datasets offered by the TLC. The yellow taxis data includes the following fields:
VendorID
: A code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.tpep_pickup_datetime
: The date and time when the meter was engaged.tpep_dropoff_datetime
: The date and time when the meter was disengaged.Passenger_count
: The number of passengers in the vehicle.Trip_distance
: The elapsed trip distance in miles reported by the taximeter.RateCodeID
: The final rate code in effect at the end of the trip. 1= Standard rate, 2= JFK, 3= Newark, 4= Nassau or Westchester, 5= Negotiated fare, 6= Group ride.Store_and_fwd_flag
: This flag indicates whether the trip record was held in the vehicle's memory before being sent to the vendor, also known as "store and forward," because the vehicle did not have a connection to the server. Y= store and forward trip, while N= not a store and forward trip.pulocationid
: Location where the meter was engaged.dolocationid
: Location where the meter was disengaged.Payment_type
: A numeric code signifying how the passenger paid for the trip. 1= Credit card, 2= Cash, 3= No charge, 4= Dispute, 5= Unknown, 6= Voided trip.Fare_amount
: The time-and-distance fare calculated by the meter.Extra
: Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.MTA_tax
: $0.50 MTA tax that is automatically triggered based on the metered rate in use.Improvement_surcharge
: $0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.Tip_amount
: This field is automatically populated for credit card tips. Cash tips are not included.Tolls_amount
: Total amount of all tolls paid in a trip.Total_amount
: The total amount charged to passengers. Does not include cash tips.congestion_surcharge
: Amount surcharges associated with time/traffic fees imposed by the city.This dataset is easy to obtain and is relatively interesting to run analytics against. The inconsistency in field naming is difficult to overlook but we will normalize using a mixture of camel case and underscore conventions later:
wget https://github.com/PacktPublishing/Serverless-Analytics-with-Amazon-Athena/raw/main/chapter_1/yellow_tripdata_2020-06.csv.gz
If you choose to download it from the NYC TLC directly, please gzip the file before proceeding to the next step.
aws s3api create-bucket \ --bucket packt-serverless-analytics \ --region us-east-1
Important Note
Be sure to substitute your bucket name and region. You can also create buckets directly from the AWS Console by logging in and navigating to S3 from the service list. Later in this chapter, we will use the AWS Console to edit and run our Athena queries. For simple operations, using the AWS CLI can be faster and easier to see what is happening since the AWS Console can hide multi-step operations behind a single button.
tables/nyc_taxi
using the following AWS CLI command. Be sure to replace the bucket name, packt-serverless-analytics
, in the following example command with the name of your bucket:aws s3 cp ./yellow_tripdata_2020-06.csv.gz \ s3://packt-serverless-analytics/tables/nyc_taxi/yellow_tripdata_2020-06.csv.gz
This command may take a few moments to complete since it needs to upload our roughly 10 MB file over the internet to Amazon S3. If you get a permission error or message about access being denied, double-check you used the right bucket name.
aws s3 ls s3://packt-serverless-analytics/tables/nyc_taxi/
aws glue create-database \ --database-input "{\"Name\":\"packt_serverless_analytics\"}" \ --region us-east-1
Now that both our data and Metastore are ready, we can define our table right from Athena itself by running our first query.