Have you experienced severe train delays, especially on the New Jersey Transit system? What are the primary causes of this delay? Where does it happen more? Does climate impact delays? These are some questions we hope to answer while building this analysis and website. But most important of all, can delay prediction be used to maintain and upgrade tracks for optimum service? New Jersey Transit trains have been experiencing delays from as long as one can remember. NJ Transit is the primary transit authority in the state of New Jersey and has control over its rail systems and tracks. This project looks into the causes of delays and assumes that most of them might be attributed to climatic conditions. As climate change progresses rapidly, these delays have worsened and the project looks into how addressing and predicting for these delays can help select those tracks that need more maintenance and checks.
We are creating a Government-facing website for NJ transit authority to predict where most train delays will happen to schedule checking, remodeling and upgradation of train tracks in those segments. We believe that prevention is better than cure. As of 2010, NJ Transit spends 67% of its budget on repairs. We hope our forecast models will allow NJ Transit to optimize repair funding more efficiently and prevent delays. This analysis uses a simple Machine Learning model based on a linear regression, making it both simple and easily replicable. We see great uses of a similar model in other cities like New York and Philadelphia to help predict for SEPTA and MTA delays to scale. This model is also approaching the most overlooked issue in rails: maintenance. While constructing new lines or upgrading trains might be well appreciated in the public eye, any transit employee will tell you that the most painstaking undertaking is maintenance. By assuming that a large share of delay occur because of over heating or cooling of tracks, wind speed, visibility issues or heating issues causing wire tangling, we predict for where these delays can happen so that proper checks and maintenance can be put into place.
#update.packages(repos = "https://cran.rstudio.com/")
#install.packages("stringdist")
#update.packages("kableExtra")
#detach("kableExtra:dpylr", unload = TRUE)
#install.packages("htmltools")
library(htmltools)
library(tidyverse)
library(sf)
library(lubridate)
library(httr)
library(tigris)
library(tidycensus)
library(car)
library(caret)
library(corrplot)
library(dplyr)
#library(kableExtra)
library(viridis)
library(riem)
library(gridExtra)
library(purrr)
library(stringdist)
library(knitr)
library(ggplot2)
library(RSocrata)
library(sf)
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
plotTheme <- theme(
plot.title =element_text(size=12, family = "mono", face = "bold"),
plot.subtitle = element_text(size=8, family = "mono"),
plot.caption = element_text(size = 6, family = "mono"),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1, family = "mono"),
axis.text.y = element_text(size = 10, family = "mono"),
axis.title.y = element_text(size = 10, family = "mono"),
axis.title.x = element_text(size = 10, family = "mono"),
# Set the entire chart region to blank
panel.background=element_rect(fill= "#F4EAE1"),
plot.background=element_blank(),
#panel.border=element_rect(colour="#F0F0F0"),
# Format the grid
panel.grid.major=element_line(colour="#CFC2B4",size=.2),
axis.ticks=element_blank())
mapTheme <- theme(plot.title =element_text(size=12, family = "mono", face = "bold"),
plot.subtitle = element_text(size=7 , family = "mono"),
plot.caption = element_text(size = 5 , family = "mono"),
axis.line=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
panel.background=element_rect(fill= "#F4EAE1"),
panel.border=element_blank(),
panel.grid.major=element_line(colour = 'transparent'),
panel.grid.minor=element_blank(),
legend.direction = "vertical",
legend.position = "right",
legend.text = element_text(size=8, family = "mono"),
legend.title = element_text(size = 9, family = "mono", face = "bold"),
plot.margin = margin(1, 1, 1, 1, 'cm'),
legend.key.height = unit(1, "cm"), legend.key.width = unit(0.3, "cm"))
palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")
# census_api_key("YOUR KEY GOES HERE", overwrite = TRUE)
We are looking at the late summer months (July and August) of 2018 because we hypothesize more train delays in those months. Our data comes form a Kaggle dataset and has information around what station each train is leaving from and going to, it’s scheduled time and actual time and its minutes of delay, its date, line and status. This is a pretty comprehensive data set for our analysis but it would be even better if it contained information about cause of delay.
dat2018_07 <- read.csv("data/2018_07.csv")
dat2018_08 <- read.csv("data/2018_08.csv")
dat <- rbind(dat2018_07, dat2018_08) %>%
filter(from != to) %>% filter(type != "Amtrak")
njtrans_s <- st_read("https://services6.arcgis.com/M0t0HPE53pFK525U/arcgis/rest/services/NJTransit_Rail_Stations/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson")
njtrans_l <- st_read("https://services6.arcgis.com/M0t0HPE53pFK525U/arcgis/rest/services/NJTRANSIT_RAIL_LINES_1/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson") %>% st_transform(crs = st_crs(njtrans_s))
njstate <- st_read("https://services2.arcgis.com/XVOqAjTOJ5P6ngMu/arcgis/rest/services/NJ_Counties_3857/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson") %>% st_transform(crs = st_crs(njtrans_s))
NJ transit is New Jersey’s Public Transportation Corporation. Covering a service area of 5,325 square miles, NJ Transit is the nation’s third largest provider of bus, rail and light rail transit, linking major points in New Jersey, New York and Philadelphia. The agency operates an active fleet of 1,231 trains and 93 light rail vehicles for a total of 12 rail lines statewide, it provides nearly 270 million passenger trips each year. Source
njtrans_l <- njtrans_l %>%
mutate(COLOR = case_when(
LINE_CODE == "AC" ~ "005DAA",
LINE_CODE == "BC" ~ "B9C9DF",
LINE_CODE == "ML" ~ "FFCF01",
LINE_CODE == "ME" ~ "00A94F",
LINE_CODE == "SL" ~ "B7A673",
LINE_CODE == "NC" ~ "EF3E42",
LINE_CODE == "MC" ~ "E66B5B",
LINE_CODE == "PV" ~ "8e258d",
LINE_CODE == "PRIN" ~ "EF3E42",
LINE_CODE == "RV" ~ "FAA634",
LINE_CODE == "GL" ~ "A2D5AE",
LINE_CODE == "NE" ~ "EF3E42",
LINE_CODE == "ST" ~ "f48aa4",
TRUE ~ NA_character_
))
njtrans_l$COLOR <- as.character(njtrans_l$COLOR)
njtrans_l <- njtrans_l %>%
mutate(Modified_COLOR = paste0("#", COLOR))
#plotting
ggplot() +
geom_sf(data=st_union(njstate), color= "#5D4954", fill= "#ffffff") +
geom_sf(data = njtrans_l, aes(color = Modified_COLOR), size = 6) +
scale_color_manual(values = njtrans_l$Modified_COLOR,
labels = njtrans_l$LINE_NAME) +
geom_sf(data = njtrans_s, show.legend = "point", size = 0.3) +
labs(title="NJ Transit Rail",
subtitle="New Jersey") +
guides(color = guide_legend(title = "Rail Lines",
keywidth = 1, # Adjust key width as needed
keyheight = 1, ))+
mapTheme
library(dplyr)
library(stringr)
library(sf)
dat$from <- tolower(dat$from)
njtrans_s$STATION_ID <- tolower(njtrans_s$STATION_ID)
dat <- dat %>%
mutate(
from_cleaned = as.character(from) %>%
str_replace_all(" junction", "") %>%
str_replace_all("mount", "mt") %>%
str_replace_all(" line", "") %>%
str_replace_all(" terminal", "") %>%
str_replace_all("montclair state u", "montclair st univ") %>%
str_replace_all("-montclair", "") %>%
str_replace_all(" avenue", "") %>%
str_replace_all("-williams ave", "") %>%
str_replace_all("-hackensack", "") %>%
str_replace_all(" fair lawn", "") %>%
str_replace_all(" jct.", "") %>%
str_replace_all(" rail terminal", "") %>%
str_replace_all("atlantic city rail", "atlantic city") %>%
str_replace_all(" lvl", " level") %>%
str_replace_all("philadelphia", "30th street station") %>%
str_replace_all(" transit center", "") %>%
str_replace_all("williams ave", "") %>%
str_replace_all(" view", "") %>%
str_replace_all("secaucus concourse", "secaucus lower level") %>%
str_replace_all("-", " "))
njtrans_s <- njtrans_s %>%
mutate(
from_cleaned = as.character(STATION_ID) %>%
str_replace_all(" junction", "") %>%
str_replace_all("mount", "mt") %>%
str_replace_all(" line", "") %>%
str_replace_all(" terminal", "") %>%
str_replace_all("montclair state u", "montclair st univ") %>%
str_replace_all("-montclair", "") %>%
str_replace_all(" avenue", "") %>%
str_replace_all("-williams ave", "") %>%
str_replace_all("-hackensack", "") %>%
str_replace_all(" fair lawn", "") %>%
str_replace_all(" jct.", "") %>%
str_replace_all(" rail terminal", "") %>%
str_replace_all("atlantic city rail", "atlantic city") %>%
str_replace_all(" lvl", " level") %>%
str_replace_all("philadelphia", "30th street station") %>%
str_replace_all(" transit center", "") %>%
str_replace_all("-", " ") %>%
str_replace_all("williams ave", "") %>%
str_replace_all(" view", "")
)
njtrans_s <- njtrans_s %>%
mutate(
from_cleaned = case_when(
OBJECTID_1 == "102" ~ paste(from_cleaned, "main st", sep = " "),
OBJECTID_1 == "156" ~ paste("ramsey route 17"),
TRUE ~ from_cleaned
)
)
njtrans_s <- njtrans_s %>%
mutate(
from_cleaned = case_when(
OBJECTID_1 == "149" ~ paste("middletown ny"),
OBJECTID_1 == "12" ~ paste("middletown nj"),
TRUE ~ from_cleaned
)
)
merged_data <- dat %>%
left_join(select(njtrans_s, from_cleaned, LATITUDE, LONGITUDE, geometry), by = "from_cleaned")
merged_data <- merged_data %>%
mutate(
delay = as.logical(case_when(
delay_minutes > 0.5 ~ TRUE,
delay_minutes <= 0.5 ~ FALSE
))
)
merged_data <- st_sf(merged_data, crs = 4326)
merged_data <- merged_data %>%
mutate(interval60 = floor_date(ymd_hms(scheduled_time), unit = "hour"), #datetime string
interval15 = floor_date(ymd_hms(scheduled_time), unit = "15 mins"),
week = week(interval60),
dotw = wday(interval60, label=TRUE))
stationdelays <- merged_data %>%
group_by(from_cleaned) %>%
summarize(
total_observations = n(),
total_delays = sum(delay),
delay_pct = total_delays / total_observations,
total_delayed_minutes = sum(delay_minutes, na.rm = TRUE),
mean_delayed_minutes = mean(delay_minutes, na.rm = TRUE),
LATITUDE = first(LATITUDE),
LONGITUDE = first(LONGITUDE),
week = first(week),
dotw = first(dotw))
From the maps it is clear that the maximum delay minutes are experienced at the hub of NJ: Newark. Delay minutes are clustered near the part of the state with most stations with fewer delays happening on the outskirts. Some stations closer to New York such as Newark station and Secaucus junction have total delayed minutes close to 5000 minutes in 2 months leading huge revenue loss. Interestingly Princeton and Princeton Junction also have close to 4000 minutes of delays in 60 days which could mean the loss of precious time for students, faculty and staff attending Princeton. This shows the dire need for our tool to understand pain points for NJ Transit and address them locally.
delayed_trips <- merged_data %>%
filter(delay)
plot1 <- ggplot() +
geom_sf(data = st_union(njstate), fill = "#F4EAE1") +
stat_density_2d(
data = as.data.frame(st_coordinates(delayed_trips)),
aes(x = X, y = Y, fill = ..level..),
geom = 'polygon',
bins = 50, alpha = 0.3
) +
scale_fill_gradient(low = "#BBC2DD", high = "#032863", name = "Density") +
labs(title = "Density of Delayed Trips", subtitle = "NJ Transit")+
mapTheme
plot2 <- ggplot() +
geom_sf(data = st_union(njstate), color = "#5D4954", fill = "#BBC2DD") +
geom_sf(data = njtrans_l, color = "white", size = 1) +
geom_sf(data = stationdelays, aes(color = total_delayed_minutes), size = 1) +
scale_color_viridis_c(name = "Delayed Mins", option = "plasma") +
scale_size_continuous(name = "Size (points)", range = c(3, 10), labels = scales::number_format(unit = "pt")) + # Adjust size range as needed
labs(title = "Total Delayed Minutes per Station",
subtitle = "NJ Transit") +
guides(color = guide_legend(
title = "Minutes",
keywidth = 1, # Adjust key width as needed
keyheight = 1, # Adjust key height as needed
label.position = "bottom" # Adjust label position
)) +
mapTheme
grid.arrange(plot1, plot2, ncol =2)
Just within our 2-month time frame, delays occur across NJ by 10’s of thousands of minutes cumulatively, namely along the Northeast Corridor. According to our top ten histogram, Newark Penn Station in particular experiences over 2,000 hours of delays cumulatively over 2 months in 2018. There is an interesting difference in upper level trains and lower level trains at Secaucus, which might be because of higher maintenance and support needed for tracks on upper level. In terms of mean delayed minutes, Monmouth Park seems to have on an average 8 minutes of delays for late trains. Other stations are between 6-8 minutes each.
top_stations_total <- stationdelays %>%
top_n(10, total_delayed_minutes)
top_stations_mean <- stationdelays %>%
top_n(10, mean_delayed_minutes)
grid.arrange(
ggplot(top_stations_total, aes(x = from_cleaned, y = total_delayed_minutes)) +
geom_bar(stat = "identity", fill = "#032863", color = "#BBC2DD") +
labs(title = "Top 10 Stations with Highest Total Delayed Minutes",
x = "Station",
y = "Total Delayed Minutes") + plotTheme,
ggplot(top_stations_mean, aes(x = from_cleaned, y = mean_delayed_minutes)) +
geom_bar(stat = "identity", fill = "#032863", color = "#BBC2DD") +
labs(title = "Top 10 Stations with Highest Mean Delayed Minutes",
x = "Station",
y = "Total Delayed Minutes") + plotTheme,
ncol = 1)
The image shows that NJ Transit has to clear out leaves every Fall using AquaTrack:
Here is an example of train tracks facing extreme heat and bending:
We believe that this is the most important factor that influences train delays. The first four plots below are pulled from the University of Iowa weather repository based off different weather monitoring stations at Airports across the world. We derive the EWR (Newark airport) data since it acts as the main airport for the state. As visible, there was rainfall on some days of the week but most days did not see any precipitation. The wind speeds went upto a high of 30 miles/ hour but most days saw a mix of 0,10 and 20 miles/ hour. Temperature in the beginning of July was in the range of eighties but saw peaks going upto 100 degrees some days.Visibility was on an average at 10 but dropped on some days. On days with high precipitation, visibility is low.
The fifth plot looks at the overall time pattern which shows a daily periodicity and lower usage periods on weekends. This plot below shows that there were around 1500 trips taken daily. Again, lesser trips are taken on weekends, maybe lesser than 500, (seen in the periodic dip) and fairly the same amount of trips are taken on weekdays.
weather.Panel <-
riem_measures(station = "EWR", date_start = "2018-07-01", date_end = "2018-08-31") %>%
dplyr::select(valid, tmpf, p01i, sknt, vsby)%>%
replace(is.na(.), 0)
weather.Panel <- weather.Panel %>%
mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
group_by(interval60) %>%
summarize(Temperature = max(tmpf),
Precipitation = sum(p01i),
Wind_Speed = max(sknt),
Visibility = max(vsby)
)%>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
glimpse(weather.Panel)
grid.arrange(
ggplot(weather.Panel, aes(interval60,Precipitation)) +
geom_area(col="#24775f", fill = "#BBC2DD") +
geom_line() +
labs(title="Precipitation", x="Hour", y="Precipitation") + plotTheme,
ggplot(weather.Panel, aes(interval60,Wind_Speed)) +
geom_area(col="#24775f", fill = "#032863") +
geom_line() +
labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
ggplot(weather.Panel, aes(interval60,Temperature)) +
geom_area(col="#24775f", fill = "#BF421B") +
geom_line() +
labs(title="Temperature", x="Hour", y="Temperature") +
plotTheme,
ggplot(weather.Panel, aes(interval60,Visibility)) +
geom_area(col="#24775f", fill = "#BFAE9B") +
geom_line() +
labs(title="Visibility", x="Hour", y="Visibility") +
plotTheme,
ggplot(merged_data %>%
group_by(interval60) %>%
tally())+
geom_line(aes(x = interval60, y = n))+
labs(title="Trips",
x="Date",
y="Number of trips")+
plotTheme,
ncol= 1,
top="Weather Data - July to August, 2018")
The above plot shows a histogram of how many trips were delayed. Some trips were delayed by even 2 hours. There were a few trips that we delayed by half an hour, an hour and even by 2 hours. Most trains were delayed between 0 to 5 minutes. This means that shorter delays are more common than longer delays which might have implications on how the website could be used.
The plots below shows the distribution of train trips across 24 hours. It seems that weekdays see a huge increase in morning trains between 7am-9am and evening rush between 4pm-6pm. On an average, weekdays have way more trains running. Weekends have a high number of trains close to midnight which may be a result of the nightlife in New Jersey and New York and die down between 1-7 am. Between 8am and 9pm, there is an average of 2000 trips. This could be beneficial information in even scheduling maintenance and checks at times that are less popular. It also highlights on what times most critical serviceability is needed: lesser delay minutes here could mean more revenue for NJ Transit which could mean more upgraded infrastructure.
grid.arrange(
ggplot(merged_data %>%
mutate(hour = hour(actual_time))) +
geom_freqpoly(aes(hour, color = dotw), binwidth = 1) +
labs(title = "Train schedule NJT, by day of the week, July-August, 2018",
x = "Hour",
y = "Trip Counts") +
plotTheme,
ggplot(merged_data %>%
mutate(hour = hour(actual_time),
weekend = ifelse(dotw == "Sun" | dotw == "Sat", "Weekend", "Weekday"))) +
geom_freqpoly(aes(hour, color = weekend), binwidth = 1) +
labs(title = "Train schedule NJT - weekend vs weekday, July-August, 2018",
x = "Hour",
y = "Trip Counts") +
plotTheme
)
For our use case, we decided to employ both spatial and time-series variables through appending a weather panel to our train trip panel. The resulting model data is a time-series data that looks at delay data and weather data in intervals of one hour, as well the existing spatial attributes such as station, stop, etc.
The following panel joins our train data and weather panel.
Next, we will create new columns with time lags by an hour, 2 hours, 3 hours 4 hours and 12 hours in addition to a day. We will also create a new column called holidays that will evaluate if the day is a holiday from previous data and classify it as one.
ride.panel <-
ride.panel %>%
arrange(from_id, interval60) %>%
mutate(lagHour = dplyr::lag(delay_minutes,1),
lag2Hours = dplyr::lag(delay_minutes,2),
lag3Hours = dplyr::lag(delay_minutes,3),
lag4Hours = dplyr::lag(delay_minutes,4),
lag12Hours = dplyr::lag(delay_minutes,12),
lag1day = dplyr::lag(delay_minutes,24),
holiday = ifelse(yday(interval60) == 148,1,0)) %>%
mutate(day = yday(interval60)) %>%
mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
holidayLag = ifelse(is.na(holidayLag) == TRUE, 0, holidayLag))
library(gridExtra)
library(ggplot2)
grid.arrange(
ggplot(ride.panel) +
geom_point(aes(x = Precipitation, y = delay_minutes), color = "#032863", alpha = 0.4) +
geom_smooth(aes(x = Precipitation, y = delay_minutes), method = "lm", se = FALSE, color = "#BBC2DD") +
labs(title = "Delay Minutes vs Precipitation",
x = "Precipitation",
y = "Delay Minutes") + plotTheme,
ggplot(ride.panel) +
geom_point(aes(x = Temperature, y = delay_minutes), color = "#032863", alpha = 0.4) +
geom_smooth(aes(x = Temperature, y = delay_minutes), method = "lm", se = FALSE, color = "#BBC2DD") +
labs(title = "Delay Minutes vs Temperature",
x = "Temperature",
y = "Delay Minutes") + plotTheme,
ggplot(ride.panel) +
geom_point(aes(x = Wind_Speed, y = delay_minutes), color = "#032863", alpha = 0.4) +
geom_smooth(aes(x = Wind_Speed, y = delay_minutes), method = "lm", se = FALSE, color = "#BBC2DD") +
labs(title = "Delay Minutes vs Wind Speed",
x = "Wind_Speed",
y = "Delay Minutes") + plotTheme,
ggplot(ride.panel) +
geom_point(aes(x = Visibility, y = delay_minutes), color = "#032863", alpha = 0.4) +
geom_smooth(aes(x = Visibility, y = delay_minutes), method = "lm", se = FALSE, color = "#BBC2DD") +
labs(title = "Delay Minutes vs Visibility",
x = "Visibility",
y = "Delay Minutes") + plotTheme,
ncol = 2,
top = "Weather and Delay Minutes Correlation - July to August, 2018"
)
Now we will check to see if delays are affected by extreme weather conditions. From the graphs, it seems like some of the longest delays have occurred when it was not raining. As the precipitation amount increases, there are long delays. Towards the end of the graph, as amount of precipitation increases, there are fewer and shorter delay. This points to mild correlation.
In terms of Temperature, it seems like these months had an average of 70 degrees and high delays were experienced some days. Especially, very long delays were observed when the temperature of that day was at a higher end (80 or above). It seems that wind speed and visibility might not be very correlated to delays because both of them see steady delays across all wind speeds and visibility. We will try to run some models now, mixing some of these weather variables.
To run different models, we first split our data into a training and
a test set. We create three linear models using the lm
function (linear models). The first three weeks of July (week 26-28) are
used as training data and the last week of July and first week of August
is used as the testing data. We create the models using our training
data ride.Train
. The first models include only temporal
controls, but the later ones contain all of our lag information. We are
using a partition that is user specified and is time dependent. We chose
to split our data by time.
ride.Train <- dplyr::filter(ride.panel, week %in% c(28, 29, 30))
ride.Test <- dplyr::filter(ride.panel, week < 28)
All our regression models contain independent variable as
delay_minutes
which is the number delayed minutes. Our
first regression has dependent variables as interval60
which is each unique date and hour of the months of July and August,
dotw
or the day of the week and temperature
.
The second regression has dependent variables as
from_cleaned
which is the departure station id, day of the
week and all our climatic variables like Temperature, Precipitation,
Visibility and Wind speed. The third regression is similar to the second
one with additional data related to lag (1,2, 3, 12 hours and one day)
and lagholiday
and holiday
data.
reg1 <-
lm(delay_minutes ~ interval60 + dotw + Temperature, data=ride.Train)
reg2 <-
lm(delay_minutes ~ from_cleaned + interval60 + dotw + Temperature + Precipitation + Visibility + Wind_Speed,
data=ride.Train)
reg3 <-
lm(delay_minutes ~ from_cleaned + interval60 + dotw + Temperature + Precipitation + Visibility + Wind_Speed +
lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday,
data=ride.Train)
To test how well the models do, we will use a k-fold
cross validation where we assign cvIDs (cross validation IDs) randomly
to all unique rows. We will run this cross validation over all five
weeks (that we had previously split into train and test) to see how
generalizable and accurate our models
are. We will show the top 5 samples from running all k-fold validations
and see how well the models perform across all five models. We will
split data based on time, meaning each fold represents data from a
different time period. This is because we have seen that our data
exhibits temporal patterns, and it helps assess how well the model
generalizes to unseen time periods.
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)
ride.Train <- st_set_geometry(ride.Train, NULL)
reg.cv1 <-
train(delay_minutes ~ ., data = ride.Train %>%
dplyr::select(delay_minutes,
interval60, dotw , Temperature),
method = "lm", trControl = fitControl, na.action = na.pass)
reg.cv2 <-
train(delay_minutes ~ ., data = ride.Train %>%
dplyr::select(delay_minutes,
from_cleaned ,interval60, dotw, Temperature, Precipitation, Visibility, Wind_Speed),
method = "lm", trControl = fitControl, na.action = na.pass)
reg.cv3 <-
train(delay_minutes ~ ., data = ride.Train %>%
dplyr::select(delay_minutes,
from_cleaned , interval60, dotw, Temperature, Precipitation,
lagHour, lag2Hours, lag3Hours, lag12Hours, lag1day, holiday, Visibility, Wind_Speed),
method = "lm", trControl = fitControl, na.action = na.pass)
extract_results <- function(reg_cv, model_name) {
results <- as.data.frame(reg_cv$resample)
results$model <- model_name
# Filter for the first 5 folds
results <- results[1:5, ]
return(results)
}
all_results <- bind_rows(
extract_results(reg.cv1, "reg.cv1"),
extract_results(reg.cv2, "reg.cv2"),
extract_results(reg.cv3, "reg.cv3")
)
library(kableExtra)
kable(all_results, caption = "Regression Model Cross-Validation Results", align = "c") %>%
kable_styling("striped", full_width = FALSE)
RMSE | Rsquared | MAE | Resample | model |
---|---|---|---|---|
5.325470 | 0.0022141 | 3.168298 | Fold001 | reg.cv1 |
5.465386 | 0.0051035 | 3.241763 | Fold002 | reg.cv1 |
5.476234 | 0.0134466 | 3.227853 | Fold003 | reg.cv1 |
5.350452 | 0.0069206 | 3.174769 | Fold004 | reg.cv1 |
5.029503 | 0.0092997 | 3.127089 | Fold005 | reg.cv1 |
5.549391 | 0.0234141 | 3.222613 | Fold001 | reg.cv2 |
4.659120 | 0.0447820 | 2.976860 | Fold002 | reg.cv2 |
4.767482 | 0.0845974 | 2.971308 | Fold003 | reg.cv2 |
4.573005 | 0.0676700 | 2.938677 | Fold004 | reg.cv2 |
5.711818 | 0.0464883 | 3.294688 | Fold005 | reg.cv2 |
5.907981 | 0.0504834 | 3.268331 | Fold001 | reg.cv3 |
5.860619 | 0.0349745 | 3.111318 | Fold002 | reg.cv3 |
5.633685 | 0.0613958 | 3.215898 | Fold003 | reg.cv3 |
5.165269 | 0.0599809 | 3.126408 | Fold004 | reg.cv3 |
4.737500 | 0.0768399 | 2.927418 | Fold005 | reg.cv3 |
Now we will look at the first five folds for each of our regressions across Root Mean Squared Error(RMSE), R-squared, and Mean absolute error(MAE).
RMSE (Root Mean Squared Error): This metric measures the average magnitude of the errors between predicted and actual values. For the best model, we are looking at lower RMSE values and for the model to generalize well, we are looking for consistency across folds. Regression 1 has high RMSE values which indicates poor model performance but they are fairly consistent. Similarly, reg2 and reg 3 have similar RMSE values but they fluctuate quite a bit.
Rsquared: This metric measures the proportion of the variance in the dependent variable that is predictable from the independent variables. A higher R-squared value suggests that a larger proportion of the variance in the dependent variable is explained by the model. Higher R-squared values are desirable. We are looking at higher r-squared values and more consistency. regression 1,2 and 3 have very low r-squared values that are lower than 10% dependability.
MAE (Mean Absolute Error): This metric represents the average absolute difference between predicted and actual values. Like RMSE, lower MAE values indicate better model performance. The MAEs decrease as we go down the table. This means that reg1 has the highest MAE which is still not too much while reg3 has the least mean absolute error.
Now we create a nested dataframe and a function called
model_pred
that will be used to test data.
ride.Test.weekNest
is a nested data
frame which has data grouped into subsets, and each subset (or “nest”)
contains a data frame.week_predictions
, is a data frame
where each row represents a combination of observed and predicted values
along with corresponding metrics.
week_predictions <- ride.Test.weekNest %>%
mutate(
reg1_FE = map(.x = data, fit = reg1, .f = model_pred),
reg2_FE = map(.x = data, fit = reg2, .f = model_pred),
reg3_FE = map(.x = data, fit = reg3, .f = model_pred)
) %>%
gather(Regression, Prediction, -data, -week) %>%
mutate(
Observed = map(data, pull, delay_minutes),
Absolute_Error = map2(Observed, Prediction, ~ abs(.x - .y)),
MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE)
)
week_predictions
## # A tibble: 6 × 8
## week data Regression Prediction Observed Absolute_Error MAE sd_AE
## <dbl> <list> <chr> <list> <list> <list> <dbl> <dbl>
## 1 26 <sf> reg1_FE <dbl [4,575]> <dbl> <dbl [4,575]> 3.18 4.12
## 2 27 <sf> reg1_FE <dbl [49,139]> <dbl> <dbl [49,139]> 3.53 5.56
## 3 26 <sf> reg2_FE <dbl [4,575]> <dbl> <dbl [4,575]> 2.84 4.08
## 4 27 <sf> reg2_FE <dbl [49,139]> <dbl> <dbl [49,139]> 3.38 5.50
## 5 26 <sf> reg3_FE <dbl [4,575]> <dbl> <dbl [4,575]> 2.78 3.78
## 6 27 <sf> reg3_FE <dbl [49,139]> <dbl> <dbl [49,139]> 3.34 5.20
It’s apparent that holiday time lags don’t seem to matter probably because most US holidays lie on the weekend. Now we can visualize error metrics to evaluate the accuracy of the different regressions and determine the best one.
week_predictions %>%
dplyr::select(week, Regression, MAE) %>%
gather(Variable, MAE, -Regression, -week) %>%
ggplot(aes(week, MAE)) +
geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
scale_fill_manual(values = palette5) +
labs(title = "Mean Absolute Errors by model specification and week") +
plotTheme
As previously stated, the third regression model (the lag model) is strongest in terms of its lower MAE, meaning it is the most accurate. For example, regression 3’s MAE ranges from between 2.5-3.5 minutes, meaning it predicts delay minutes with a mean absolute error above or below the observed value.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
from_cleaned = map(data, pull, from_cleaned)) %>%
dplyr::select(interval60, from_cleaned, Observed, Prediction, Regression) %>%
unnest(cols = c(interval60, from_cleaned, Observed, Prediction)) %>%
gather(Variable, Value, -Regression, -interval60, -from_cleaned) %>%
group_by(Regression, Variable, interval60) %>%
summarize(Value = sum(Value, na.rm = TRUE), .groups = 'drop') %>%
ggplot(aes(interval60, Value, colour = Variable)) +
geom_line(size = 1.1) +
scale_color_manual(values = c("#032863", "#BBC2DD")) +
facet_wrap(~Regression, ncol = 1) +
labs(
title = "Predicted/Observed Delay Minutes",
subtitle = "NJ Transit July-August 2018",
x = "Hour",
y = "Delay Minutes"
) +
plotTheme
Overall, our models seem to fit moderately with the observed values which indicates some accuracy. Notice the predictions in light blue tend to overlap with the observed values in dark blue well. Regression 3 seems to be performing the best and hence we will use it further. Next we look at mean absolute errors by station. From the graph below, there seems to be a spatial pattern to the errors. Specifically the central part of NJ has more mean absolute errors than any other part. Also, errors are not usually on the lower end which indicates moderate accuracy.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
from_cleaned = map(data, pull, from_cleaned),
lat1 = map(data, pull, LATITUDE),
lng1 = map(data, pull, LONGITUDE)) %>%
select(interval60, from_cleaned, lng1, lat1, Observed, Prediction, Regression) %>%
unnest() %>%
filter(Regression == "reg3_FE") %>%
group_by(from_cleaned, lng1, lat1) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = st_union(njstate), color = "grey", fill = "transparent")+
geom_point(aes(x = lng1, y = lat1, color = MAE),
fill = "transparent", alpha = 0.4)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "rocket")+
ylim(min(stationdelays$LATITUDE), max(stationdelays$LATITUDE))+
xlim(min(stationdelays$LONGITUDE), max(stationdelays$LONGITUDE))+
labs(title="Mean Absolute Error, Test Set, Model 3")+
mapTheme
Lastly, in the above map, we can see which stops/lines have the highest MAE values, thus are the least accurate. The darkest (highest MAE) value points appear mostly along the Raritan Valley Line and the North Jersey Coast Line, suggesting greater attention and monitoring be in place there, while the northern lines like the Pascack Valley Line, Gladstone Branch, and Main Line have lower MAE values, indicating our model performs better there.
Next we will plot observed vs. predicted for different times of day during the week and weekend. The plots of the weekends and weekdays show that our model is best at predicting delay times towards the beginning of the day, but observed values vary from predicted values far more for evening and overnight delays, particularly during the weekdays.
week_predictions %>%
mutate(
interval60 = map(data, pull, interval60),
from_cleaned = map(data, pull, from_cleaned),
lat1 = map(data, pull, LATITUDE),
lng1 = map(data, pull, LONGITUDE),
dotw = map(data, pull, dotw)
) %>%
select(
interval60, from_cleaned, lng1, lat1, Observed, Prediction, Regression, dotw
) %>%
unnest() %>%
dplyr::filter(Regression == "reg3_FE") %>%
mutate(
weekend = ifelse(dotw == "Sun" | dotw == "Sat", "Weekend", "Weekday"),
time_of_day = case_when(
hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"
)
) %>%
ggplot() +
geom_point(aes(x = Observed, y = Prediction), color = "#032863", alpha = 0.4) +
geom_smooth(aes(x = Observed, y = Prediction), method = "lm", se = FALSE, color = "#BBC2DD") +
geom_abline(slope = 1, intercept = 0) +
facet_grid(time_of_day ~ weekend) +
labs(
title = "Observed vs Predicted",
x = "Observed trips",
y = "Predicted trips"
) +
plotTheme
Now we will look at our errors on a map by weekend/weekday and time of day. In general, accuracy and MAE value appears somewhat line-dependent, meaning errors will occur along a particular line. If we look at the maps horizontally, across the time of day, MAE values become greater over the course of the day from day to night, as discussed in the previous plots as well. For weekdays in particular, it is the Raritan Valley line that becomes most increasingly inaccurate amongst the lines, while on weekends, it is most visibly the North Jersey Coast Line, Morristown Line, and Raritan Valley Line. The errors in the model for these lines suggest more data, feature engineering, etc. that may better account for these delays.
week_predictions %>%
mutate(
interval60 = map(data, pull, interval60),
from_cleaned = map(data, pull, from_cleaned),
lat1 = map(data, pull, LATITUDE),
lng1 = map(data, pull, LONGITUDE),
dotw = map(data, pull, dotw)
) %>%
select(
interval60, from_cleaned, lng1,
lat1, Observed, Prediction, Regression,
dotw
) %>%
unnest() %>%
filter(Regression == "reg3_FE") %>%
mutate(
weekend = ifelse(dotw == "Sun" | dotw == "Sat", "Weekend", "Weekday"),
time_of_day = case_when(
hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"
)
) %>%
group_by(from_cleaned, weekend, time_of_day, lng1, lat1) %>%
summarize(MAE = mean(abs(Observed - Prediction), na.rm = TRUE)) %>%
ggplot() +
geom_sf(data = st_union(njstate), color = "darkgrey", fill = "transparent") +
geom_point(
aes(x = lng1, y = lat1, color = MAE),
fill = "transparent", size = 0.7, alpha = 0.7
) +
scale_colour_viridis_c(direction = -1, option = "mako") +
ylim(min(stationdelays$LATITUDE), max(stationdelays$LATITUDE)) +
xlim(min(stationdelays$LONGITUDE), max(stationdelays$LONGITUDE)) +
facet_grid(weekend ~ time_of_day) +
labs(title = "Mean Absolute Errors, Test Set") +
mapTheme
The model saw 2 weeks of training data and 3 weeks of test data and was made with a variety of diverse and representation datasets. The k-fold Cross-validation test showed that the model was fairly generalizable because the RMSE values were on the low end.
The model does alright in terms of accuracy because it has high Mean absolute errors (more than 3 for the test months). As seen in 6.2, the model performs well in predicting for most days and does not predict well for outliers (like July 3rd in Section 6.2). There are also more errors in stations with higher delay minutes. This shows the fluctuating nature of delays and strengthens the call for repairs in those heavily delayed stations.
Our use case set out to identify delays specific to rail lines to schedule repair and check. By accurately predicting delayed minutes, we are able to spatially determine which stations have higher delays and thus helping NJ Transit decide how to be more proactive in selecting stations that could in the future experience more delays.
These are some ideas that could make the model perform stronger:
In conclusion, we believe our project is a proactive approach to track maintenance through the lens of delay consequences. It is cost reducing, and can increase resilience and customer satisfaction.
We recommend NJ transit to use our model to help predict and plan. Our government-facing website, is set to include information around, first, daily statistics of where and by how long trains are delayed, second, track health prediction for 2 months ahead: including where will tracks be under most stress from climatic hazards, and third, a record for NJT authorities to schedule checks and repairs.
Our wireframing is here
Watch our pitch here.