paint-brush
What's the Deal With Data Engineers Anyway?by@darigain
294 reads

What's the Deal With Data Engineers Anyway?

by Aidar GainDecember 6th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This article explains the basics of data engineering through a practical ETL pipeline project. Learn how to extract data from APIs and web scraping, transform it into usable formats, and load it into a relational database for insights. A beginner-friendly guide with real-world applications in weather, flights, and city data.
featured image - What's the Deal With Data Engineers Anyway?
Aidar Gain HackerNoon profile picture


Imagine you’re building a house. You have the structure — walls and a roof — but what’s the first thing you need to make it usable? Correct! Utilities — water, electricity, gas. Without them, the house is just an empty shell. Building and maintaining the plumbing to ensure everything flows smoothly is essential to making a house functional. This is the perfect analogy for the role of a data engineer.


I have some experience in data engineering through my years as a data analyst. Continuing the analogy, if data engineers are responsible for building and maintaining the utilities, data analysts are like the people who decide how to use them effectively — setting up smart thermostats, monitoring energy consumption, or planning renovations. Throughout my journey, I collaborated with data engineers and gained a high-level understanding of their responsibilities. Without their work, analysts like me would be stuck waiting for “water” (data) to arrive. Thanks to data engineers, I always had access to structured and up-to-date data.


Now, I’d like to take you through a basic learning project to dive into this domain a little deeper.


Disclaimer: In some organizations, especially smaller ones, the roles of data engineers and analysts may overlap.

The Problem

Imagine you’re a junior data engineer working for an e-scooter service company. One of your tasks is to collect data from external sources, such as weather and flight information. You might wonder, “Why is this data even necessary?” Here’s why:


  • When it rains, the demand for e-scooters drops significantly.
  • Tourists often prefer using e-scooters to explore cities.


By analyzing patterns like these, weather and flight data can help predict usage trends and optimize the distribution of e-scooters. For example, if rain is forecasted in one area, or a large wave of tourists is expected due to flight arrivals, the company can proactively manage scooter locations to meet changing demand.


Your goal is to ensure this data is well-organized, updated daily, and easily accessible from a database.Let’s dive into the process of extracting, transforming, and loading this data.

Data Extraction

To extract external data (or “water”), there are two common approaches:

  • Pump from a river: Web scraping.
  • Connect to a central water supply: APIs.


In this project, I used both methods.

Web Scraping

Web scraping involves extracting data directly from web pages. I scraped Wikipedia for city information, such as population and coordinates, using Python libraries like requests and BeautifulSoup. Here’s how it works:


  • Requests: Fetches the HTML content of a web page.
  • BeautifulSoup: Parses the HTML to extract specific elements, like population or latitude.


HTML (Hypertext Markup Language) is the standard language for structuring web pages. Data like population is often located within HTML elements like <table> or <div> tags.


While scraping can be useful, it’s often inefficient and unreliable:

  • Some websites use JavaScript to load content dynamically, making it harder to extract.
  • Others block scraping attempts using security mechanisms like CAPTCHA.

APIs

In the previous step, we retrieved the longitude and latitude for the cities where the company operates. This data is crucial for making requests to APIs, which are designed to provide structured and reliable data access. Let me explain:


An API (Application Programming Interface) is a tool that allows computers or programs to communicate with each other. For instance, I used the OpenWeather API to fetch real-time weather data and the Aerodatabox API for airport and flight details.


However, just like accessing a central water supply, using an API often comes with a cost. OpenWeather offers a limited number of free daily requests, while Aerodatabox requires a subscription for extended use. Additionally, APIs use an API Key — a unique code that authenticates requests to ensure that only authorized users can access the service.


Of course, many APIs can work directly with city names instead of requiring latitude and longitude. However, it’s valuable to explore multiple approaches and understand how geographic coordinates are used, isn’t it?


By making polite requests to the API with our location data, we receive structured responses, like weather forecasts or flight information, ready for analysis.


This data looks well-structured and convenient for extraction.

Data Transformation

The next step in ETL is Transformation — cleaning and organizing raw data into a usable format.

Common tasks include:

  • Removing unnecessary columns.
  • Formatting data for consistency.
  • Handling missing values.


The most interesting transformation in my project was converting latitude/longitude data from DMS format (e.g., 52°31′12″N) to decimal format (e.g., 52.52000). This made the data compatible with APIs. I wrote a Python function to handle this conversion. In simple terms, it’s like attaching a filter to your tap to get cleaner, better-tasting water.

Data Loading

The final step is Loading the cleaned data into a relational database for easy access and analysis.


Why not use Excel or CSV files?

Excel is like having a separate water container for every tap, washing machine, and shower — sufficient for small needs but disconnected and inefficient as your system grows. MySQL, on the other hand, is like a modern plumbing system that delivers water to every part of the house seamlessly and reliably. It’s scalable, efficient, and keeps everything linked together.


In this project, I used a MySQL database with tables for cities, population, weather, airports, and flights. After defining the schema and relationships between tables using SQL, I wrote Python functions to insert data using SQLAlchemy and pandas. Now, the database snapshot looks well-organized and ready for use.

Automation

So far, I’ve executed the pipeline manually using Python, collecting all the necessary data into the database. Again, it is a significant piece of programming code that executes sequentially and performs various operations, such as creating a connection, retrieving data, or pushing a table to an SQL database.


Key Pipeline Function:

def extracting_and_loading_data(list_of_cities):
   connection_string = create_connection_string()
   cities_df = get_cities_data(list_of_cities)
   cities_to_db = transform_cities_df(cities_df)
   push_unique_data_to_db(cities_to_db, 'cities', connection_string)
   cities_from_sql = extract_from_db('cities', connection_string)
   population_to_db = transform_population(cities_df, cities_from_sql)
   push_unique_data_to_db(population_to_db, 'population', connection_string)
   weather_df = get_weather_data(cities_from_sql)
   push_data_to_db(weather_df, 'weather', connection_string)
   airports_df = get_airports_data(cities_from_sql)
   airports_to_db = transform_airports_df(airports_df)
   push_unique_data_to_db(airports_to_db, 'airports', connection_string)
   airports_from_sql = extract_from_db('airports', connection_string)
   flights_df = get_flights_data(airports_from_sql)
   push_data_to_db(flights_df, 'flights', connection_string)
   return "Data has been updated"
extracting_and_loading_data(["Berlin", "Hamburg", "Munich", "Cologne", "Stuttgart", "Leipzig", "Dortmund", "Vienna"])


But automation is essential for maintaining up-to-date data. While some information (e.g., city population) rarely changes, weather and flight data need regular updates. To automate the pipeline:

  1. Scheduling: Use a job scheduler (e.g., cron) to run the Python script daily.
  2. Notifications: Send alerts via Slack, Telegram or email if the pipeline succeeds or fails.


Once these steps are complete, monitoring the system and fixing issues as they arise will ensure the process remains smooth.

Conclusion

This project showcases the foundational role of a data engineer in building a reliable data pipeline, much like setting up the plumbing in a house. From extracting data through web scraping and APIs to transforming it into usable formats, and finally loading it into a relational database, each step ensures that the “utilities” (data) flow smoothly for analysts and decision-makers.


We explored how to collect weather and flight data to help an e-scooter company predict usage patterns and respond to challenges like weather changes or tourist activity. The pipeline automates these processes, ensuring the data is updated daily and accessible to everyone in the company.


This isn’t the end of the journey — it’s the beginning. The groundwork laid by data engineers allows data analysts and scientists to dive deeper, uncover insights, and build predictive models. Together, these roles drive data-driven decision-making.


If you’re curious about the technical details or want to explore the code behind this project, feel free to visit my GitHub repository here. There, you’ll find all the scripts, functions, and documentation to help you dive deeper into the project.


If I’ve managed to make these concepts clear and relatable, feel free to share your thoughts or ask questions — I’d love to know if this article has inspired you to explore the world of data engineering!