As data scientists, we need to turn data into business insight and bring impact to the company. Therefore, we need ETL as the first step to break down the data silos and make it easy to be analyzed by data scientists to achieve the goals.
In this part, I am going to explain the ETL using Luigi as the simple tool to do your task in the data pipeline. There are some goals that you will get in this article:
1. What is ETL and its benefits
2. What is Luigi and how you set up Luigi
3. Hands-on to do ETL in Luigi
The Almighty ETL
In the organization, the data needs to be monitored in the process of collecting, transforming and migrating/loading the data which finally could be used by some stakeholders. For instance, data analysts retrieve an insight from data, and data scientists do statistical analysis and create machine learning models. Therefore, the abbreviation comes up with ETL which stands for Extract, Transform, and Load.
The organization may choose not to use any ETL tools to do the task. However, the data comes every time and needs maintenance, and using ETL tools bring some benefits such as:
- Enhanced business intelligence
- Timely access data
- Enhance quality and consistent
- High return on investment (ROI)
Now, you are aware of why organizations need to do ETL process to the overall data that is generated by them. After this, I provide you one of the ETL tools from Spotify called Luigi.
What is Luigi?
Luigi is a Python (3.6, 3.7 tested) package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
Luigi itself has several basic components and terminologies:
- Task is about doing something as the basic data processing step in a pipeline. For instance, a task can implement order processing or report generation
- Target is doing after the task has finished which is the second building in the pipeline. For instance, the report generation create a CSV file with the actual report
- Requires contains all the tasks instance that must be executed before the current task.
- Output is the step to store the output task. This output contains one or more target objects.
- Run is the step to do several logics to run the ETL
To start the Luigi, you must prepare:
- Python IDE (I will use Pycharm community edition)
In a terminal, you can check the version of your python and the version should be python 3 by writing this python --version, once you know the version of python, you can create a virtual environment python and follow these steps:
- writing python3 -m venv ~/Documents/demoday, this is my folder path (~/Documents/demoday)
- Activate the virtual environment by writing this source ~/Documents/demoday/bin/activate
- Install Luigi by writing this pip install luigi
- Check the tools in your folder by writing pip list. Here is the result
In Pycharm, you just set up the new project with some steps to be followed:
1. Create a new project
2. Click Project Intrepreter and click the triple dot in Existing Intrepreter
3. In the intrepeter window, find the python inside your virtual environment folder and create.
Hands-on ETL using Luigi
Everything has been set up and you are ready to create your ETL in Luigi.
1. Practice creating a report of sales
First, you will create a CSV file that contains the month and amount of sales and save it into orders.csv, after that you will transform the data to calculate the total amount in each month. Finally, you will load the data and save it in a CSV file called report.csv
To run the code, you can directly run in Pycharm or you can run in your terminal by typing python [your python file], in my terminal, I type python sales_report.py. You can see the report.csv that contains May with 280 sales and June with 410 sales.
2. Practice compiling the existing file into one file
This ETL code helps you to extract several CSV files and compiled them into one CSV file. This practice is harder than the previous one, but if you are familiar with the first practice, it doesn’t matter for you
You can also see the pipeline of your ETL by typing luigid in your terminal and typing http://localhost:8082/ and click view graph in Action columns and finally chose D3 in visualization type.
Congratulations!! You finally can create your ETL process using Luigi. This article is just an introduction and basic ETL. However, you now know the fundamentals to create ETL in Luigi. I have compiled intermediate code in creating ETL and load in the different database. If you are interested to know the code, you can see my GitHub here.