Building a Simple Batch Data Pipeline from AWS RDS to Google BigQuery — Part 1: Setting UP AWS Data pipeline

Muhammad Izzuddin
TheLorry Data, Tech & Product
8 min readJan 1, 2021

--

A couple of months ago, I was assigned to set up a batch data pipeline using AWS Data pipeline service to bring our operation(production) data from AWS RDS(used as our production db) into Google Bigquery(our data warehouse) on daily basis. The goal was to setup faster and more efficient querying dashboards in Holistics (An awesome BI tool).

I have divided this guide into 2 sections, one section will be covering the necessary configuration inside AWS and the second will be covering configurations required inside GCP. Almost all the steps explained here will deal with UI configuration of each service with minimal coding required, So if you a beginner at Data Engineering (as I am), I hope this guide will be a kickstart for your journey as well.

Our approach will be based on ELT paradigm. Wait! what LT… In any organization’s analytics workflow, the most intensive step usually lies in the data preparation step: that is, combining, cleaning, and creating data sets that are ready for business consumption and decision making. This function is commonly known as ETL (Extract, Transform, and Load), which identifies the three distinct stages involved. In this process, an ETL tool extracts the data from different data source systems, transforms the data by applying calculations, concatenations, and the like, and finally loads the data into the data warehouse.

ELT is a different way of looking at this problem. Instead of transforming the data before it is loaded into the database, ELT does the transformation within the data warehouse. Your data will be loaded into the data warehouse first, and then transformed in place.

ELT Process (Source: The Analytics Setup Guidebook by Holistic.io)

The key point to note in ELT process is that raw data is transformed inside the data warehouse without the need of a staging server; your data warehouse now contains both raw data and transformed data. We prefer our pipelines this way at TheLorry. The Analytics Setup Guidebook is a great a resource if you're interested to deep dive into these data analytics/engineering core concepts including many others.

Okay enough theory, let’s get back to our pipeline. Following the ELT strategy, first, we Extract the data from AWS RDS, Load it to BigQuery, and then Transform it inside BigQuery. The main reason we choose this ELT paradigm is because our transformations are quite simple and the learning curve to perform these transformation using SQL is much lower compared to other tools (like AWS glue or Google Dataflow).

(The reason why we create a single final denormalized table beforehand is because like most cases in our case, it is way cheaper and faster to read from a single denormalized table, compared to querying from multiple joined tables.)

Here is the overall picture regarding the pipeline and cloud services involved in the process.

Data pipeline flow from MYSQL RDS into Google Bigquery
Data pipeline flow from AWS RDS into Google Bigquery

We use Amazon Relational Database Service (Amazon RDS) for majority of our projects. It makes it easy to set up, operate, and scale a relational database in the cloud. It provides cost-efficient and resizable capacity while automating time-consuming administration tasks such as hardware provisioning, database setup, patching and backups. You can get started with their free tier plan too.

Based on the above diagram, here are the step

  1. Extract data from ‘MYSQL RDS’ and bring into S3 using ‘AWS data pipeline’ service ( this service will create a new file based on date field your choose inside designated table’s folder)
  2. From S3, bring the file inside Bigquery using ‘BigqQuery transfer service’. (all the daily files for each table will be appended to the target table inside Bigquery).
  3. On daily basis, run transformation on this table and joined the table using Bigquery ‘Scheduled Query’ service to make a finalized denormalized table.
  4. the denormalized table then can be connected to a third-party BI dashboard or can be used for ad-hoc queries.

For our first post, we will be covering the necessary configuration inside AWS using AWS ‘Datapipeline’ service.

Click on create a new pipeline

You can choose either to build the pipeline from scratch or ‘Build using a template’. Build a template will provide us with the basic pipeline building blocks and configuration required to generate the pipeline.

Since we want to extract new rows of data based on their created datetime, we will be choosing ‘incremental copy of RDS MYSQL table to S3’.

in the parameters sections, fill the field value:

  • last modified column name — the pipeline will extract the incremental copy of the table by referring to this column timeline.
  • RDS MySQL password
  • RDS MySQL table name — insert the name of one of the tables that you want to extract.
  • Output S3 folder — the output location for the table inside S3
  • RDS MySql username
  • EC2 instance type: the compute engine required for this pipeline process.
  • RDS instance ID
Pipeline Creation Console

Fill up the schedule tab on the schedule extraction that you want to execute on your RDS table. If you only want to extract a single table, you can just click activate.

Since we wanted to schedule extraction of multiple tables from our database, we further edited our pipeline template by clicking ‘edit pipeline’.

Edit Pipeline Console

Here is the pipeline generated from our template and based on the parameters that we choose. In order to add multiple schedule extraction from multiple tables inside RDS, click ‘Add’ button at the left top.

Since our database is MySQL in RDS, click SQLDataNode

a DefaultSQLNode will appear under the Schedule Node. We need to bring this SQLDataNode under the RdsDatabase node (as this node will extract data from our RDS database). AWS Data pipeline console does not allow us to simply drag and drop the node. We have to configure the link between the nodes through the nodes settings panel.

Click ‘DefaultSqlDataNode’, on the right tab under the ‘DefaultSQLDataNode1’ . Rename it to ‘table2'. Fill the table field with the table name that we want to extract from our RDS database. Click ‘Add an optional field’ and choose ‘Database’.

Now in order to bring SqlDataNode1 under our rds_mysql node, click the Dataset field and choose our ‘rds_mysql’

Now our ‘table2’ node will now appear below our rds_mysql node.

we need to put query inside our ‘table 2’ node. To do so, again choose ‘Add Optional Field’ and choose select ‘Select Query’ you can either write your own query or you can just copy the ‘Select Query’ statement from ‘SourceRDSTable’ (that is automagically generated from the template we choose’ as for my case, Different tables have the same LastModifiedCol name.

Basically, #{} refer to the parameter that we declare, #{table} referred to the table field that we declared inside our ‘table2’ data node. The starting datetime and ending datetime of the query extraction will refer to ‘#{format(@scheduledStartTime, ‘YYYY-MM-dd HH-mm-ss’)}’ and ‘#{format(@scheduledEndTime, ‘YYYY-MM-dd HH-mm-ss’)}’. The @scheduledStartTime and @scheduledEndTime was derived from the ‘Schedule’ field that we set from our template.

After we are done with configuring our ‘table2’ SQL Node, we need to assign ‘CopyActivity’ below our table2 Node

Again, our new ‘CopyActivity’ node will appear under ‘Schedule’ node. to bring our ‘CopyActivity’ under our table2 node, click ‘CopyActivity’ node, and click ‘input’ field and choose table2.

This will bring our newly created ‘CopyActivity’ node under our table2 node.

Rename the ‘CopyActivity’ node to ‘copy_table2’. For any ‘CopyActivity’ we need to allocate compute resources to the node. Choose ‘Runs on’ field from our copy_table2

In this ‘Runs On’ field, select our EC2Instance node’

The copy_table2 node will now run on our Ec2Insance.

We need to assign output s3 location for our table_2. Click again on our ‘copy_table2’, and choose ‘Create new: DataNode’

Click our new Data node. Under ‘Type’, choose ‘S3DataNode’, we need to declare the file path for our output. For our case, the file name that will be written inside S3 table2 folder will have ‘YYYY-MM-dd-HH-mm-ss’ format in it.

Basically, that is what you have to do to extract a single table from RDS to S3 on daily basis. If you want to add new copy activity for a new table, just repeat the steps mentioned for ‘table2’.

Click save and activate our pipeline. Below is my Datapipeline dashboard. This dashboard will show the summary of our Datapipeline executions (‘Schedule Interval’, ‘Type’, the status of activity inside the pipeline, execution start, execution end’ and ‘attempts’). By default, each activity will run atleast 3 times if it fails.

To diagnose the failed attempt for each activity, just click the toggle button, and based on the error message, you can do any necessary adjustment and rerun the ‘Component Name’ again.

Congratulation, we are done with setting up AWS Datapipeline service. In the next article, I will take you through the remaining steps to complete our ELT Process Workflow.

--

--