Airport data is seasonal in nature, therefore any comparative analyses should be done on a period-over-period basis (i.e. January 2010 vs. January 2009) as opposed to period-to-period (i.e. January 2010 vs. February 2010). It is also important to note that fact and attribute field relationships are not always 1-to-1. For example, Passenger Counts belonging to United Airlines will appear in multiple attribute fields and are additive, which provides flexibility for the user to derive categorical Passenger Counts as desired.
Data Source:https://catalog.data.gov/dataset/air-traffic-passenger-statistics
## Load libraires
import pandas as pd
import numpy as np
import calendar
import warnings
import pandas as pd
import matplotlib.pyplot as plt
from itertools import combinations
from collections import Counter
warnings.filterwarnings("ignore")
# let load dataset
air_traffic_passenger_stats_df = pd.read_csv(r"C:\Users\jki\Downloads\Air_Traffic_Passenger_Statistics.csv")
air_traffic_passenger_stats_df.head(5)
Activity Period | Activity Period Start Date | Operating Airline | Operating Airline IATA Code | Published Airline | Published Airline IATA Code | GEO Summary | GEO Region | Activity Type Code | Price Category Code | Terminal | Boarding Area | Passenger Count | data_as_of | data_loaded_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 199907 | 7/1/1999 | ATA Airlines | TZ | ATA Airlines | TZ | Domestic | US | Deplaned | Low Fare | Terminal 1 | B | 31432 | 12/20/2023 14:00 | 12/20/2023 16:47 |
1 | 199907 | 7/1/1999 | ATA Airlines | TZ | ATA Airlines | TZ | Domestic | US | Enplaned | Low Fare | Terminal 1 | B | 31353 | 12/20/2023 14:00 | 12/20/2023 16:47 |
2 | 199907 | 7/1/1999 | ATA Airlines | TZ | ATA Airlines | TZ | Domestic | US | Thru / Transit | Low Fare | Terminal 1 | B | 2518 | 12/20/2023 14:00 | 12/20/2023 16:47 |
3 | 199907 | 7/1/1999 | Aeroflot Russian International Airlines | NaN | Aeroflot Russian International Airlines | NaN | International | Europe | Deplaned | Other | Terminal 2 | D | 1324 | 12/20/2023 14:00 | 12/20/2023 16:47 |
4 | 199907 | 7/1/1999 | Aeroflot Russian International Airlines | NaN | Aeroflot Russian International Airlines | NaN | International | Europe | Enplaned | Other | Terminal 2 | D | 1198 | 12/20/2023 14:00 | 12/20/2023 16:47 |
# lets check for missing values
missing_values = air_traffic_passenger_stats_df.isna().sum()
print(missing_values)
Activity Period 0 Activity Period Start Date 0 Operating Airline 0 Operating Airline IATA Code 316 Published Airline 0 Published Airline IATA Code 316 GEO Summary 0 GEO Region 0 Activity Type Code 0 Price Category Code 0 Terminal 0 Boarding Area 0 Passenger Count 0 data_as_of 0 data_loaded_at 0 dtype: int64
# let remove the missing values
air_traffic_passenger_stats_df.dropna(subset=['Operating Airline IATA Code','Published Airline IATA Code'],inplace = True)
# lets check for missing values
missing_values = air_traffic_passenger_stats_df.isna().sum()
print(missing_values)
Activity Period 0 Activity Period Start Date 0 Operating Airline 0 Operating Airline IATA Code 0 Published Airline 0 Published Airline IATA Code 0 GEO Summary 0 GEO Region 0 Activity Type Code 0 Price Category Code 0 Terminal 0 Boarding Area 0 Passenger Count 0 data_as_of 0 data_loaded_at 0 dtype: int64
# let check if have unwanted negtaive values
air_traffic_passenger_stats_df.describe()
Activity Period | Passenger Count | |
---|---|---|
count | 34714.000000 | 34714.000000 |
mean | 201172.677508 | 28240.497695 |
std | 702.139961 | 63011.306219 |
min | 199907.000000 | 0.000000 |
25% | 200603.000000 | 4567.250000 |
50% | 201206.000000 | 8721.000000 |
75% | 201804.000000 | 20077.000000 |
max | 202310.000000 | 856501.000000 |
# lets have a look at the data types
air_traffic_passenger_stats_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 34714 entries, 0 to 35029 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Activity Period 34714 non-null int64 1 Activity Period Start Date 34714 non-null object 2 Operating Airline 34714 non-null object 3 Operating Airline IATA Code 34714 non-null object 4 Published Airline 34714 non-null object 5 Published Airline IATA Code 34714 non-null object 6 GEO Summary 34714 non-null object 7 GEO Region 34714 non-null object 8 Activity Type Code 34714 non-null object 9 Price Category Code 34714 non-null object 10 Terminal 34714 non-null object 11 Boarding Area 34714 non-null object 12 Passenger Count 34714 non-null int64 13 data_as_of 34714 non-null object 14 data_loaded_at 34714 non-null object dtypes: int64(2), object(13) memory usage: 4.2+ MB
# let change the date data type
air_traffic_passenger_stats_df['Activity Period Start Date'] = pd.to_datetime(air_traffic_passenger_stats_df['Activity Period Start Date'])
air_traffic_passenger_stats_df['Activity Period Start Date'].info()
<class 'pandas.core.series.Series'> Int64Index: 34714 entries, 0 to 35029 Series name: Activity Period Start Date Non-Null Count Dtype -------------- ----- 34714 non-null datetime64[ns] dtypes: datetime64[ns](1) memory usage: 542.4 KB
# Get the count of each Airline
airline_counts = air_traffic_passenger_stats_df['Operating Airline'].value_counts()
# Display the counts
print(airline_counts)
# Sort the values in descending order and select the top ten
top_ten_airline_frequency = airline_counts.sort_values(ascending=False).head(10)
# Display the result
print(top_ten_airline_frequency)
# Plot the top ten Airlines
top_ten_airline_frequency.plot(kind='bar', color='purple', figsize=(10, 6))
plt.title('Top Ten Airlines with highest Frequency')
plt.xlabel('Operating Airline')
plt.ylabel('Frequency')
plt.show()
United Airlines 6454 SkyWest Airlines 2038 Alaska Airlines 1794 Delta Air Lines 775 American Airlines 709 ... Air Atlanta Icelandic 2 Tradewinds Airlines 2 Reno Air 2 Harmony Airways 1 Casino Express 1 Name: Operating Airline, Length: 117, dtype: int64 United Airlines 6454 SkyWest Airlines 2038 Alaska Airlines 1794 Delta Air Lines 775 American Airlines 709 Air Canada 688 United Airlines - Pre 07/01/2013 614 Southwest Airlines 599 US Airways 592 Frontier Airlines 591 Name: Operating Airline, dtype: int64
#Get the count of Domestic and International Flights
GEO_Summary_counts = air_traffic_passenger_stats_df['GEO Summary'].value_counts()
# Display the counts
print(GEO_Summary_counts)
# Sort the values in descending order and select the top ten
top_ten_GEO_Summary_frequency = GEO_Summary_counts.sort_values(ascending=False).head(10)
# Display the result
print(top_ten_GEO_Summary_frequency)
# Plot the Domestic and International Flights
top_ten_GEO_Summary_frequency.plot(kind='bar', color='red', figsize=(10, 6))
plt.title('Frequency of Domestic and International Flights ')
plt.xlabel('GEO Summary')
plt.ylabel('Frequency')
plt.show()
International 22264 Domestic 12450 Name: GEO Summary, dtype: int64 International 22264 Domestic 12450 Name: GEO Summary, dtype: int64
# Get the count of Destination
GEO_Area_counts = air_traffic_passenger_stats_df['GEO Region'].value_counts()
# Display the counts
print(GEO_Area_counts)
# Sort the values in descending order and select the top ten
top_ten_GEO_Area_destination = GEO_Area_counts.sort_values(ascending=False).head(10)
# Display the result
print(top_ten_GEO_Area_destination)
# Plot the Destinations
top_ten_GEO_Area_destination.plot(kind='bar', color='red', figsize=(10, 6))
plt.title('Frequency of Destinations ')
plt.xlabel('Destinations')
plt.ylabel('Frequency')
plt.show()
US 12450 Asia 7501 Europe 5628 Canada 3234 Mexico 2626 Australia / Oceania 1724 Central America 811 Middle East 649 South America 91 Name: GEO Region, dtype: int64 US 12450 Asia 7501 Europe 5628 Canada 3234 Mexico 2626 Australia / Oceania 1724 Central America 811 Middle East 649 South America 91 Name: GEO Region, dtype: int64
# Get the count of Terminal Frequency
Terminal_counts = air_traffic_passenger_stats_df['Terminal'].value_counts()
# Display the counts
print(Terminal_counts)
# Sort the values in descending order and select the top five
top_five_Terminal_counts = Terminal_counts.sort_values(ascending=False).head(5)
# Display the result
print(top_five_Terminal_counts)
# Plot the Destinations
top_five_Terminal_counts.plot(kind='bar', color='red', figsize=(10, 6))
plt.title('Activity of Teminals')
plt.xlabel('Teminals')
plt.ylabel('Frequency')
plt.show()
International 21384 Terminal 1 6463 Terminal 3 4616 Terminal 2 2225 Other 26 Name: Terminal, dtype: int64 International 21384 Terminal 1 6463 Terminal 3 4616 Terminal 2 2225 Other 26 Name: Terminal, dtype: int64
average_passenger_number = air_traffic_passenger_stats_df['Passenger Count'].mean()
rounded_average_passenger_number = round(average_passenger_number)
print(rounded_average_passenger_number)
28240
total_passenger_numbers = air_traffic_passenger_stats_df['Passenger Count'].max()
rounded_total_passenger_numbers = round(total_passenger_numbers)
print(rounded_total_passenger_numbers)
856501
air_traffic_passenger_stats_df.head(5)
Activity Period | Activity Period Start Date | Operating Airline | Operating Airline IATA Code | Published Airline | Published Airline IATA Code | GEO Summary | GEO Region | Activity Type Code | Price Category Code | Terminal | Boarding Area | Passenger Count | data_as_of | data_loaded_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 199907 | 1999-07-01 | ATA Airlines | TZ | ATA Airlines | TZ | Domestic | US | Deplaned | Low Fare | Terminal 1 | B | 31432 | 12/20/2023 14:00 | 12/20/2023 16:47 |
1 | 199907 | 1999-07-01 | ATA Airlines | TZ | ATA Airlines | TZ | Domestic | US | Enplaned | Low Fare | Terminal 1 | B | 31353 | 12/20/2023 14:00 | 12/20/2023 16:47 |
2 | 199907 | 1999-07-01 | ATA Airlines | TZ | ATA Airlines | TZ | Domestic | US | Thru / Transit | Low Fare | Terminal 1 | B | 2518 | 12/20/2023 14:00 | 12/20/2023 16:47 |
5 | 199907 | 1999-07-01 | Air Canada | AC | Air Canada | AC | International | Canada | Deplaned | Other | Terminal 1 | B | 24124 | 12/20/2023 14:00 | 12/20/2023 16:47 |
6 | 199907 | 1999-07-01 | Air Canada | AC | Air Canada | AC | International | Canada | Enplaned | Other | Terminal 1 | B | 23613 | 12/20/2023 14:00 | 12/20/2023 16:47 |
# create new column for dates
air_traffic_passenger_stats_df['Year Activity'] = air_traffic_passenger_stats_df['Activity Period Start Date'].dt.year
air_traffic_passenger_stats_df['Month Activity'] = air_traffic_passenger_stats_df['Activity Period Start Date'].dt.month
air_traffic_passenger_stats_df.head(5)
Activity Period | Activity Period Start Date | Operating Airline | Operating Airline IATA Code | Published Airline | Published Airline IATA Code | GEO Summary | GEO Region | Activity Type Code | Price Category Code | Terminal | Boarding Area | Passenger Count | data_as_of | data_loaded_at | Year Activity | Month Activity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 199907 | 1999-07-01 | ATA Airlines | TZ | ATA Airlines | TZ | Domestic | US | Deplaned | Low Fare | Terminal 1 | B | 31432 | 12/20/2023 14:00 | 12/20/2023 16:47 | 1999 | 7 |
1 | 199907 | 1999-07-01 | ATA Airlines | TZ | ATA Airlines | TZ | Domestic | US | Enplaned | Low Fare | Terminal 1 | B | 31353 | 12/20/2023 14:00 | 12/20/2023 16:47 | 1999 | 7 |
2 | 199907 | 1999-07-01 | ATA Airlines | TZ | ATA Airlines | TZ | Domestic | US | Thru / Transit | Low Fare | Terminal 1 | B | 2518 | 12/20/2023 14:00 | 12/20/2023 16:47 | 1999 | 7 |
5 | 199907 | 1999-07-01 | Air Canada | AC | Air Canada | AC | International | Canada | Deplaned | Other | Terminal 1 | B | 24124 | 12/20/2023 14:00 | 12/20/2023 16:47 | 1999 | 7 |
6 | 199907 | 1999-07-01 | Air Canada | AC | Air Canada | AC | International | Canada | Enplaned | Other | Terminal 1 | B | 23613 | 12/20/2023 14:00 | 12/20/2023 16:47 | 1999 | 7 |
import calendar
# Replace NaN or inf values in the 'Month' column with a default value (e.g., 0)
air_traffic_passenger_stats_df['Month Activity'] = air_traffic_passenger_stats_df['Month Activity'].fillna(0)
# Convert month numbers to abbreviated month names only if they are numeric
air_traffic_passenger_stats_df['Month Activity'] = air_traffic_passenger_stats_df['Month Activity'].apply(
lambda x: calendar.month_abbr[int(x)] if x.isdigit() else x
)
# Group by month and calculate total passenger numbers for each month
passenger_numbers_by_month = air_traffic_passenger_stats_df.groupby('Month Activity').sum()['Passenger Count']
# Find the best month for passenger numbers
best_month = passenger_numbers_by_month.idxmax()
passenger_numbers_for_best_month = passenger_numbers_by_month.max()
print(f"The best month for passenger numbers was {best_month} with passenger numbers of {passenger_numbers_for_best_month:,.2f}")
The best month for passenger numbers was Aug with passenger numbers of 98,041,104.00
# Replace NaN or inf values in the 'Year' column with a default value (e.g., 0)
air_traffic_passenger_stats_df['Year Activity'] = air_traffic_passenger_stats_df['Year Activity'].fillna(0)
# Convert years to integers
air_traffic_passenger_stats_df['Year Activity'] = air_traffic_passenger_stats_df['Year Activity'].astype(int)
# Group by Year and calculate passenger numbers
passenger_numbers_by_year = air_traffic_passenger_stats_df.groupby('Year Activity').sum()['Passenger Count']
# Find the best Year for passenger numbers
best_year = passenger_numbers_by_year.idxmax()
passenger_numbers_for_best_year = passenger_numbers_by_year.max()
print(f"The best Year for passenger numbers was {best_year} with passenger numbers of {passenger_numbers_for_best_year:,.2f}")
The best Year for passenger numbers was 2018 with passenger numbers of 57,746,775.00
# Group by 'Airlines' and calculate the sum of 'Passenger Counts'
passenger_count = air_traffic_passenger_stats_df.groupby('Operating Airline')['Passenger Count'].sum()
# Sort the values in descending order and select the top ten
top_airlines = passenger_count.sort_values(ascending=False).head(10)
# Display the result
print(top_airlines)
# Plot the top ten Airlines
top_airlines.plot(kind='bar', color='skyblue', figsize=(10, 6))
plt.title('Top Ten Airlies with the most Passenger numbers')
plt.xlabel('Airline')
plt.ylabel('Passenger Count')
plt.show()
Operating Airline United Airlines 371311286 American Airlines 75289134 SkyWest Airlines 72955183 Delta Air Lines 62283500 Southwest Airlines 45887207 Alaska Airlines 43875550 Virgin America 37843920 United Airlines - Pre 07/01/2013 32484390 US Airways 27771567 JetBlue Airways 17732288 Name: Passenger Count, dtype: int64
# Group by 'Destinations' and calculate the sum of 'Passenger Counts'
passenger_count = air_traffic_passenger_stats_df.groupby('GEO Region')['Passenger Count'].sum()
# Sort the values in descending order and select the top ten
top_destinations = passenger_count.sort_values(ascending=False).head(10)
# Display the result
print(top_destinations)
# Plot the top ten Airlines
top_destinations.plot(kind='bar', color='green', figsize=(10, 6))
plt.title('Top Ten Destination with the most Passenger numbers')
plt.xlabel('Destinations')
plt.ylabel('Passenger Count')
plt.show()
GEO Region US 753399618 Asia 93514952 Europe 63161620 Canada 30573750 Mexico 19744400 Australia / Oceania 10459829 Middle East 5271749 Central America 3963897 South America 250822 Name: Passenger Count, dtype: int64