For my university course in Big Data and Cloud Computing, I developed a hands-on project to process and analyze U.S. domestic flight data using a big data pipeline built on Apache Spark, Hadoop and Hive.

📂 Project Repository on GitHub

✈️ Project Overview

The dataset comes from the U.S. Department of Transportation (DOT) and includes over 20 years of flight records across domestic U.S. routes. It contains rich details such as:

  • Flight numbers, carriers, and tail numbers
  • Origin and destination airports
  • Scheduled and actual times
  • Delays and cancellations

With millions of rows and numerous fields, the dataset is an ideal candidate for distributed processing.

⚙️ Technologies Used

  • Apache Spark (Scala API) — For distributed processing and querying
  • HDFS — As the primary distributed file system for data storage
  • Hive - For datawarehousing flights
  • Docker Compose — To setuo a local dev environment
  • K8s - For the future deploy of the application

The infrastructure is designed to simulate a real-world Big data ecosystem on a developer machine.

🧠 Analysis Goals

The goal was to answer questions such as:

  • Which U.S. airports are the busiest?
  • What carriers experience the most delays?
  • Which routes have the highest cancellation rates?
  • Are there noticeable patterns in delays based on time of year?

We used Spark DataFrames and SQL for efficient querying and Python for ETL logic.

📊 Sample Insights

Here are just a few of the findings:

  • Hartsfield–Jackson Atlanta International Airport (ATL) consistently ranks as the busiest hub.
  • Certain routes exhibit seasonal delay patterns, especially in the Northeast during winter.
  • Flights operated by low-cost carriers showed higher cancellation rates, likely due to tighter fleet logistics.

🎓 Educational Value

This project helped me solidify core big data concepts:

  • Working with large-scale tabular data

  • Tuning Spark jobs and understanding data shuffling and partitioning

  • Managing multi-node workflows via YARN and HDFS

  • Applying functional programming in Scala for clean and scalable ETL pipelines