Analyzing U.S. Air Flights with Apache Spark and Hadoop

For my university course in Big Data and Cloud Computing, I developed a hands-on project to process and analyze U.S. domestic flight data using a big data pipeline built on Apache Spark, Hadoop and Hive.

📂 Project Repository on GitHub

✈️ Project Overview

The dataset comes from the U.S. Department of Transportation (DOT) and includes over 20 years of flight records across domestic U.S. routes. It contains rich details such as:

Flight numbers, carriers, and tail numbers
Origin and destination airports
Scheduled and actual times
Delays and cancellations

With millions of rows and numerous fields, the dataset is an ideal candidate for distributed processing.

⚙️ Technologies Used

Apache Spark (Scala API) — For distributed processing and querying
HDFS — As the primary distributed file system for data storage
Hive - For datawarehousing flights
Docker Compose — To setuo a local dev environment
K8s - For the future deploy of the application

The infrastructure is designed to simulate a real-world Big data ecosystem on a developer machine.

🧠 Analysis Goals

The goal was to answer questions such as:

Which U.S. airports are the busiest?
What carriers experience the most delays?
Which routes have the highest cancellation rates?
Are there noticeable patterns in delays based on time of year?

We used Spark DataFrames and SQL for efficient querying and Python for ETL logic.

📊 Sample Insights

Here are just a few of the findings:

Hartsfield–Jackson Atlanta International Airport (ATL) consistently ranks as the busiest hub.
Certain routes exhibit seasonal delay patterns, especially in the Northeast during winter.
Flights operated by low-cost carriers showed higher cancellation rates, likely due to tighter fleet logistics.

🎓 Educational Value

This project helped me solidify core big data concepts:

Working with large-scale tabular data
Tuning Spark jobs and understanding data shuffling and partitioning
Managing multi-node workflows via YARN and HDFS
Applying functional programming in Scala for clean and scalable ETL pipelines

✈️ Project Overview#

⚙️ Technologies Used#

🧠 Analysis Goals#

📊 Sample Insights#

🎓 Educational Value#

✈️ Project Overview

⚙️ Technologies Used

🧠 Analysis Goals

📊 Sample Insights

🎓 Educational Value