Pitfalls in Data Collection and Moving
I have found that many companies run into similar pitfalls when collecting and moving data. Although a company might generate and store valuable data, there could be no accessible record identifying the data or where it’s stored. For example, I worked at a company where I had to go to the department that generated the data to interpret it and then go to the department that recorded the data storage’s location. This was time-consuming and an inefficient organization system for data storage.
Also, many companies do not always use their data optimally in making decisions. Within companies there can often exist a lack of expertise about utilizing data when making decisions; data interpretation may be incorrect or absent in business-related decisions. For example, a supermarket that buys stock or produce by feeling or impulse rather than analyzing product sales figures. Moreover, many companies store their data manually in the system and only clean it up to use on a periodic basis. This passive method is relatively time-consuming, may result in confused employees tasked with the data and can be very error-prone. For instance, manually importing weekly sales figures and then manually editing them in order to create visualizations in Excel.
I hope to share a method that helps to avoid some of these problems. If during the course of reading you find that you’ve already mastered the concepts being covered, feel free to continue on to the next section.
Background Data Upload
Before we dive into the solution, let’s take a quick look at the specific steps for data loading. In general, the process can be divided into three different steps: Extract, Load and Transform (ELT). Extract involves the process that exports the target data from the source. Load refers to the process that stores the data from the previous step in your own system. Transform creates the next step that cleans up this data and changes it into a form that is usable for the intended processes. Throughout the rest of this article, this process will be abbreviated by ELT. If it is already clear for you exactly what this process entails, feel free to move on to the next section. For those of you who would like to dive into more detail, please read on.
The first step, Extract, retrieves the data you want to use from the source. This often means that the API of that service is called periodically to retrieve the data. The data then usually comes as semi-structured or unstructured, and is tailored for the intended analysis. Structured data is organized according to a predefined scheme or table, like data in an excel sheet. Semi-structured data follows a few standard style rules, but the data can be stored inconsistently within these rules, such as data in a JSON file. Unstructured data has no standard schema rules after which it is stored, such as a social media post (containing text, pictures, links, etc.).
Storage: Using a Data Lake
The second step, Load, is the process where the data from the Extract step is stored, without any transformations on the data. We call this storage a Data Lake. Large amounts of data are stored in a Data Lake according to a flat structure. This is just like the system on your own PC (explorer/finder), but you have more freedom to choose the storage location. This can be with a cloud service or on an internal server; remember to give the data a unique ID so that it can be found again.
The final Transform step is the process where the stored data is loaded from the Data Lake and transformations are applied to make it usable for analysis. This step is somewhat separate from the Extract and Load steps, as those steps are usually taken once per dataset, and then periodically to pull in new data; the Transform step only takes place as soon as it is needed. Therefore, it is quite possible for the same data to runs through several Transform processes or have several datasets from the Data Lake run through one single Transform step.
Encouraging a Data-Driven Culture
Setting up a data-driven culture within the company ensures that the Data Pipeline will be used in the best possible way. This entails the following:
First, all decisions within the company are made based on the insights generated from the data available within the company. In a non-data-driven culture, decisions are often made based on inherited company or corporate culture and operating procedures. This has the disadvantage of missing emerging trends in the market and not seeing potential gaps or opportunities in time to seize them. By making decisions based on data, you observe the emerging patterns and capitalize on them.
Second, all business processes should be fully documented. This makes it easy to retrace exactly why and how certain decisions were made. Moreover, it is possible to re-generate the data and the related insights. This reduces the black box problem, in which the source of decisions or insights eventually becomes unclear.
And, finally, the company’s own customers also get access to data-driven insights about their own profiles. This gives customers the tools to also make certain data-based decisions.
ELT versus ETL and Data Pipelines
The data loading method we discussed in the background is a good building block for a data-driven culture. Before becoming data-driven, the data needs to be accessible and useful for the employees. There are various methods of loading the data within the company, but the main ones that emerge are ETL (Extract, Transform, Load), and ELT (Extract, Load, Transform). ETL is different from ELT in that the data is first transformed and then stored in a predefined format. Remember that in ELT the data is first stored in its most raw form.
Is One Method Really Preferable?
The advantage ETL offers over ELT is that there are fewer strange values in the data and the data is easier to interpret. The ELT’s biggest pitfall is that it takes substantial time before the data can be stored in the system. At some point, moreover, additional data may be needed from a source with a corrupted link or no recorded history of data added previously.
The advantage ELT has over ETL is that the data is stored in its rawest form in the system. This gives employees more freedom to create their own data set from the Data Lake and apply their own transformations to it. The biggest pitfall of this method is losing track of the data’s overview and all records for the Transform processes.
The best method depends on the situation. Although ELT may be preferred in most contexts, some projects may be maximized with ETL, or a combination of both. Therefore, it is better to build a Data Pipeline. Constructing the Pipeline depends on choosing the best steps for each situation. Take the most optimal steps per data source, there is no best or universal method for all data sources.
Data Pipeline Tools
Let’s go into a little more detail about the Data Pipelines and look at the tools for developing a useful Data Pipeline. If you don’t require details of a Data Pipeline this can be a good point to continue to the Closure, we don’t mind! In case you’re still here, we’ll now walk through an overview of the tools that we suggest as good choices and the concepts and motivations behind the tools.
We use Python, SQL and DBT to build the Data Pipeline. DBT stands for Data Build Tool and allows you to perform data transformations simply by writing SELECT statements. It’s a sort of SQL simplification. SQL and DBT are preferred when the necessary data is already in a cloud database because SQL performs operations directly on the database. In fact, the database engine can optimize SQL queries very well, which can lead to a more efficient operation than a Python implementation. Moreover, SQL queries are more readable than Python code for most client company employees.
The advantage DBT has over SQL is that workers only need to write SELECT statements, which are then translated by DBT into full SQL statements. Thus, employees can achieve the same optimization results as working in SQL without advanced SQL knowledge.
Python is preferred in cases where the data needs to be queried via an API and is not stored in a structured way. Python can also be used to include the data into a cloud database and then perform further transformations with SQL or DBT. Consider using Python to create visualizations with the data and perform more advanced Machine Learning steps, which aims to extract insights from the data.
Data Versioning
Data versioning is a great way to keep track of changes in the data. This has the added advantage of traveling back to another version of the data. This a good option when errors have occurred in the data or when certain data points have been removed from the system. Also, past transformation steps are then reproducible with the previous version of the transformed data. This can help validate certain generated insights. DVC is a good implementation of data versioning. It is an open-source tool that offers many integration options with other storage services, such as cloud databases, cloud storages and on-premises data storage.
We use another tool, a Scheduler, to run the Data Pipeline periodically. The advantage of running the Data Pipeline according to a given schedule is that we no longer have to turn it on manually, saving time. We should only intervene if an error occurs that the scheduler cannot resolve itself, beyond that no manual actions are required. Moreover, the scheduler also offers a monitoring structure. We can now monitor the status of all data pipelines from a central tooling and quickly intervene if necessary. Apache Airflow and Prefect are good Scheduler implementations. Apache Airflow offers more features than Prefect, but Prefect is easier to implement and produces more readable code.
In Conclusion
We introduced the concept of Data Pipelines to address the issues that can arise surrounding storing and using data. Most of the steps regarding loading and using data, can be divided into three sections: Extract, where the required data is pulled in; Load, where the required data is loaded into a local data storage system; and Transform, where the necessary data is transformed for analysis purposes.
Introducing a Data-Driven Culture within the company ensures that the available data is utilized optimally. Decisions within the company should be based on data and this data should also be offered to the customer as a service. The most optimal way of loading and using data depends on what data you use and the goal or purpose. All the steps involved together are known as a Data Pipeline.
We also discussed the tools for implementing a Data Pipeline. In the next part of this article, we will perform a Data Pipeline Implementation. This will be completed by providing a roadmap and implementing this roadmap into a working Python code. We hope you enjoy loading data in an automated manner!