Image credits

1. Introduction and Use Case

Have you experienced severe train delays, especially on the New Jersey Transit system? What are the primary causes of this delay? Where does it happen more? Does climate impact delays? These are some questions we hope to answer while building this analysis and website. But most important of all, can delay prediction be used to maintain and upgrade tracks for optimum service? New Jersey Transit trains have been experiencing delays from as long as one can remember. NJ Transit is the primary transit authority in the state of New Jersey and has control over its rail systems and tracks. This project looks into the causes of delays and assumes that most of them might be attributed to climatic conditions. As climate change progresses rapidly, these delays have worsened and the project looks into how addressing and predicting for these delays can help select those tracks that need more maintenance and checks.

Use Case

We are creating a Government-facing website for NJ transit authority to predict where most train delays will happen to schedule checking, remodeling and upgradation of train tracks in those segments. We believe that prevention is better than cure. As of 2010, NJ Transit spends 67% of its budget on repairs. We hope our forecast models will allow NJ Transit to optimize repair funding more efficiently and prevent delays. This analysis uses a simple Machine Learning model based on a linear regression, making it both simple and easily replicable. We see great uses of a similar model in other cities like New York and Philadelphia to help predict for SEPTA and MTA delays to scale. This model is also approaching the most overlooked issue in rails: maintenance. While constructing new lines or upgrading trains might be well appreciated in the public eye, any transit employee will tell you that the most painstaking undertaking is maintenance. By assuming that a large share of delay occur because of over heating or cooling of tracks, wind speed, visibility issues or heating issues causing wire tangling, we predict for where these delays can happen so that proper checks and maintenance can be put into place.

#update.packages(repos = "https://cran.rstudio.com/")
#install.packages("stringdist")
#update.packages("kableExtra")
#detach("kableExtra:dpylr", unload = TRUE)
#install.packages("htmltools")
library(htmltools)
library(tidyverse)
library(sf)
library(lubridate)
library(httr)
library(tigris)
library(tidycensus)
library(car)
library(caret)
library(corrplot)
library(dplyr)
#library(kableExtra)
library(viridis)
library(riem)
library(gridExtra)
library(purrr)
library(stringdist)
library(knitr)
library(ggplot2)
library(RSocrata)
library(sf)
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
plotTheme <- theme(
  plot.title =element_text(size=12, family = "mono", face = "bold"),
  plot.subtitle = element_text(size=8, family = "mono"),
  plot.caption = element_text(size = 6, family = "mono"),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1, family = "mono"),
  axis.text.y = element_text(size = 10, family = "mono"),
  axis.title.y = element_text(size = 10, family = "mono"),
  axis.title.x = element_text(size = 10, family = "mono"),
  # Set the entire chart region to blank
  panel.background=element_rect(fill= "#F4EAE1"),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#CFC2B4",size=.2),
  axis.ticks=element_blank())


mapTheme <- theme(plot.title =element_text(size=12, family = "mono", face = "bold"),
                  plot.subtitle = element_text(size=7 , family = "mono"),
                  plot.caption = element_text(size = 5 , family = "mono"),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_rect(fill= "#F4EAE1"),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  legend.text = element_text(size=8, family = "mono"), 
                   legend.title = element_text(size = 9, family = "mono", face = "bold"),

                  plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.3, "cm"))

palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")

# census_api_key("YOUR KEY GOES HERE", overwrite = TRUE)

2. Data wrangling

2.1 Loading train delay data

We are looking at the late summer months (July and August) of 2018 because we hypothesize more train delays in those months. Our data comes form a Kaggle dataset and has information around what station each train is leaving from and going to, it’s scheduled time and actual time and its minutes of delay, its date, line and status. This is a pretty comprehensive data set for our analysis but it would be even better if it contained information about cause of delay.

dat2018_07 <- read.csv("data/2018_07.csv")
dat2018_08 <- read.csv("data/2018_08.csv")

dat <- rbind(dat2018_07, dat2018_08) %>%
  filter(from != to) %>% filter(type != "Amtrak")
njtrans_s <- st_read("https://services6.arcgis.com/M0t0HPE53pFK525U/arcgis/rest/services/NJTransit_Rail_Stations/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson")
njtrans_l <- st_read("https://services6.arcgis.com/M0t0HPE53pFK525U/arcgis/rest/services/NJTRANSIT_RAIL_LINES_1/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson") %>% st_transform(crs = st_crs(njtrans_s))
njstate <- st_read("https://services2.arcgis.com/XVOqAjTOJ5P6ngMu/arcgis/rest/services/NJ_Counties_3857/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson") %>% st_transform(crs = st_crs(njtrans_s))

2.2 About NJ Transit network

NJ transit is New Jersey’s Public Transportation Corporation. Covering a service area of 5,325 square miles, NJ Transit is the nation’s third largest provider of bus, rail and light rail transit, linking major points in New Jersey, New York and Philadelphia. The agency operates an active fleet of 1,231 trains and 93 light rail vehicles for a total of 12 rail lines statewide, it provides nearly 270 million passenger trips each year. Source

njtrans_l <- njtrans_l %>%
  mutate(COLOR = case_when(
    LINE_CODE == "AC" ~ "005DAA",
    LINE_CODE == "BC" ~ "B9C9DF",
    LINE_CODE == "ML" ~ "FFCF01",
    LINE_CODE == "ME" ~ "00A94F",
    LINE_CODE == "SL" ~ "B7A673",
    LINE_CODE == "NC" ~ "EF3E42",
    LINE_CODE == "MC" ~ "E66B5B",
    LINE_CODE == "PV" ~ "8e258d",
    LINE_CODE == "PRIN" ~ "EF3E42",
    LINE_CODE == "RV" ~ "FAA634",
    LINE_CODE == "GL" ~ "A2D5AE",
    LINE_CODE == "NE" ~ "EF3E42",
    LINE_CODE == "ST" ~ "f48aa4",
    TRUE ~ NA_character_
  ))
njtrans_l$COLOR <- as.character(njtrans_l$COLOR)
njtrans_l <- njtrans_l %>%
  mutate(Modified_COLOR = paste0("#", COLOR))

#plotting
ggplot() + 
  geom_sf(data=st_union(njstate), color= "#5D4954", fill= "#ffffff") +
  geom_sf(data = njtrans_l, aes(color = Modified_COLOR), size = 6) +
  scale_color_manual(values = njtrans_l$Modified_COLOR,
                     labels = njtrans_l$LINE_NAME) +
  geom_sf(data = njtrans_s, show.legend = "point", size = 0.3) +
  labs(title="NJ Transit Rail", 
       subtitle="New Jersey") +
  guides(color = guide_legend(title = "Rail Lines",
                              keywidth = 1,  # Adjust key width as needed
                              keyheight = 1,  ))+ 
   mapTheme

library(dplyr)
library(stringr)
library(sf)
dat$from <- tolower(dat$from)
njtrans_s$STATION_ID <- tolower(njtrans_s$STATION_ID)

dat <- dat %>%
  mutate(
    from_cleaned = as.character(from) %>%
      str_replace_all(" junction", "") %>%
      str_replace_all("mount", "mt") %>%
      str_replace_all(" line", "") %>%
      str_replace_all(" terminal", "") %>%
      str_replace_all("montclair state u", "montclair st univ") %>%
      str_replace_all("-montclair", "") %>%
      str_replace_all(" avenue", "") %>%
      str_replace_all("-williams ave", "") %>%
      str_replace_all("-hackensack", "") %>%
      str_replace_all(" fair lawn", "") %>%
      str_replace_all(" jct.", "") %>%
      str_replace_all(" rail terminal", "") %>%
      str_replace_all("atlantic city rail", "atlantic city") %>%
      str_replace_all(" lvl", " level") %>%
      str_replace_all("philadelphia", "30th street station") %>%
      str_replace_all(" transit center", "") %>%
      str_replace_all("williams ave", "") %>%
      str_replace_all(" view", "") %>%
      str_replace_all("secaucus concourse", "secaucus lower level") %>%
      str_replace_all("-", " "))

njtrans_s <- njtrans_s %>%
  mutate(
    from_cleaned = as.character(STATION_ID) %>%
      str_replace_all(" junction", "") %>%
      str_replace_all("mount", "mt") %>%
      str_replace_all(" line", "") %>%
      str_replace_all(" terminal", "") %>%
      str_replace_all("montclair state u", "montclair st univ") %>%
      str_replace_all("-montclair", "") %>%
      str_replace_all(" avenue", "") %>%
      str_replace_all("-williams ave", "") %>%
      str_replace_all("-hackensack", "") %>%
      str_replace_all(" fair lawn", "") %>%
      str_replace_all(" jct.", "") %>%
      str_replace_all(" rail terminal", "") %>%
      str_replace_all("atlantic city rail", "atlantic city") %>%
      str_replace_all(" lvl", " level") %>%
      str_replace_all("philadelphia", "30th street station") %>%
      str_replace_all(" transit center", "") %>%
      str_replace_all("-", " ") %>%
      str_replace_all("williams ave", "") %>%
      str_replace_all(" view", "")
  ) 

njtrans_s <- njtrans_s %>%
  mutate(
    from_cleaned = case_when(
      OBJECTID_1 == "102" ~ paste(from_cleaned, "main st", sep = " "),
      OBJECTID_1 == "156" ~ paste("ramsey route 17"),
      TRUE ~ from_cleaned
    )
  )

njtrans_s <- njtrans_s %>%
  mutate(
    from_cleaned = case_when(
      OBJECTID_1 == "149" ~ paste("middletown ny"),
      OBJECTID_1 == "12" ~ paste("middletown nj"),
      TRUE ~ from_cleaned
    )
  )

merged_data <- dat %>%
 left_join(select(njtrans_s, from_cleaned, LATITUDE, LONGITUDE, geometry), by = "from_cleaned") 

merged_data <- merged_data %>%
  mutate(
    delay = as.logical(case_when(
      delay_minutes > 0.5 ~ TRUE,
      delay_minutes <= 0.5 ~ FALSE
    ))
  )

merged_data <- st_sf(merged_data, crs = 4326)
merged_data <- merged_data %>%
  mutate(interval60 = floor_date(ymd_hms(scheduled_time), unit = "hour"), #datetime string
         interval15 = floor_date(ymd_hms(scheduled_time), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE))
stationdelays <- merged_data %>%
  group_by(from_cleaned) %>%
  summarize(
    total_observations = n(),
    total_delays = sum(delay),
    delay_pct = total_delays / total_observations,
    total_delayed_minutes = sum(delay_minutes, na.rm = TRUE),
    mean_delayed_minutes = mean(delay_minutes, na.rm = TRUE),
    LATITUDE = first(LATITUDE),
    LONGITUDE = first(LONGITUDE),
    week = first(week),
    dotw = first(dotw))

3. Data exploration

3.1 Where do most delays occur?

From the maps it is clear that the maximum delay minutes are experienced at the hub of NJ: Newark. Delay minutes are clustered near the part of the state with most stations with fewer delays happening on the outskirts. Some stations closer to New York such as Newark station and Secaucus junction have total delayed minutes close to 5000 minutes in 2 months leading huge revenue loss. Interestingly Princeton and Princeton Junction also have close to 4000 minutes of delays in 60 days which could mean the loss of precious time for students, faculty and staff attending Princeton. This shows the dire need for our tool to understand pain points for NJ Transit and address them locally.

delayed_trips <- merged_data %>%
  filter(delay)

plot1 <- ggplot() +
  geom_sf(data = st_union(njstate), fill = "#F4EAE1") +
  stat_density_2d(
    data = as.data.frame(st_coordinates(delayed_trips)),
    aes(x = X, y = Y, fill = ..level..),
    geom = 'polygon',
    bins = 50, alpha = 0.3
  ) +
  scale_fill_gradient(low = "#BBC2DD", high = "#032863", name = "Density") +
  labs(title = "Density of Delayed Trips",  subtitle = "NJ Transit")+
  mapTheme

plot2 <- ggplot() +
  geom_sf(data = st_union(njstate), color = "#5D4954", fill = "#BBC2DD") +
  geom_sf(data = njtrans_l, color = "white", size = 1) +
  geom_sf(data = stationdelays, aes(color = total_delayed_minutes), size = 1) +
  scale_color_viridis_c(name = "Delayed Mins", option = "plasma") + 
  scale_size_continuous(name = "Size (points)", range = c(3, 10), labels = scales::number_format(unit = "pt")) +  # Adjust size range as needed
  labs(title = "Total Delayed Minutes per Station", 
       subtitle = "NJ Transit") +
  guides(color = guide_legend(
    title = "Minutes",
    keywidth = 1,  # Adjust key width as needed
    keyheight = 1, # Adjust key height as needed
    label.position = "bottom"  # Adjust label position
    )) +
  mapTheme

grid.arrange(plot1, plot2, ncol =2)

3.2 What stations are experiencing most delays?

Just within our 2-month time frame, delays occur across NJ by 10’s of thousands of minutes cumulatively, namely along the Northeast Corridor. According to our top ten histogram, Newark Penn Station in particular experiences over 2,000 hours of delays cumulatively over 2 months in 2018. There is an interesting difference in upper level trains and lower level trains at Secaucus, which might be because of higher maintenance and support needed for tracks on upper level. In terms of mean delayed minutes, Monmouth Park seems to have on an average 8 minutes of delays for late trains. Other stations are between 6-8 minutes each.

top_stations_total <- stationdelays %>%
  top_n(10, total_delayed_minutes)
top_stations_mean <- stationdelays %>%
  top_n(10, mean_delayed_minutes)

grid.arrange(
  ggplot(top_stations_total, aes(x = from_cleaned, y = total_delayed_minutes)) +
    geom_bar(stat = "identity", fill = "#032863", color = "#BBC2DD") +
    labs(title = "Top 10 Stations with Highest Total Delayed Minutes",
         x = "Station",
         y = "Total Delayed Minutes") + plotTheme,

  ggplot(top_stations_mean, aes(x = from_cleaned, y = mean_delayed_minutes)) +
    geom_bar(stat = "identity", fill = "#032863", color = "#BBC2DD") +
    labs(title = "Top 10 Stations with Highest Mean Delayed Minutes",
         x = "Station",
         y = "Total Delayed Minutes") + plotTheme,
  ncol = 1)

3.3 Do climatic conditions have an impact on delays?

The image shows that NJ Transit has to clear out leaves every Fall using AquaTrack:

NJ Transit Aqua Track

Source

Here is an example of train tracks facing extreme heat and bending:

Train Tracks buckling in extreme heat

We believe that this is the most important factor that influences train delays. The first four plots below are pulled from the University of Iowa weather repository based off different weather monitoring stations at Airports across the world. We derive the EWR (Newark airport) data since it acts as the main airport for the state. As visible, there was rainfall on some days of the week but most days did not see any precipitation. The wind speeds went upto a high of 30 miles/ hour but most days saw a mix of 0,10 and 20 miles/ hour. Temperature in the beginning of July was in the range of eighties but saw peaks going upto 100 degrees some days.Visibility was on an average at 10 but dropped on some days. On days with high precipitation, visibility is low.

The fifth plot looks at the overall time pattern which shows a daily periodicity and lower usage periods on weekends. This plot below shows that there were around 1500 trips taken daily. Again, lesser trips are taken on weekends, maybe lesser than 500, (seen in the periodic dip) and fairly the same amount of trips are taken on weekdays.

weather.Panel <- 
  riem_measures(station = "EWR", date_start = "2018-07-01", date_end = "2018-08-31") %>% 
  dplyr::select(valid, tmpf, p01i, sknt, vsby)%>%
  replace(is.na(.), 0)

weather.Panel <- weather.Panel %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt),
              Visibility = max(vsby)
                )%>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

glimpse(weather.Panel)

grid.arrange(
  ggplot(weather.Panel, aes(interval60,Precipitation)) +
  geom_area(col="#24775f", fill = "#BBC2DD") +
  geom_line() + 
  labs(title="Precipitation", x="Hour", y="Precipitation") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Wind_Speed)) +
     geom_area(col="#24775f", fill = "#032863") +
    geom_line() + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Temperature)) +
    geom_area(col="#24775f", fill = "#BF421B") +
    geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") +
  plotTheme, 
  ggplot(weather.Panel, aes(interval60,Visibility)) +
    geom_area(col="#24775f", fill = "#BFAE9B") +
    geom_line() + 
    labs(title="Visibility", x="Hour", y="Visibility") +
  plotTheme, 
  ggplot(merged_data %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n))+
  labs(title="Trips",
       x="Date", 
       y="Number of trips")+
  plotTheme,
  ncol= 1, 
  top="Weather Data -  July to August, 2018") 

3.4 What is the distribution of delayed minutes like?

The above plot shows a histogram of how many trips were delayed. Some trips were delayed by even 2 hours. There were a few trips that we delayed by half an hour, an hour and even by 2 hours. Most trains were delayed between 0 to 5 minutes. This means that shorter delays are more common than longer delays which might have implications on how the website could be used.

ggplot(merged_data) +
  geom_histogram(aes(x = delay_minutes, y = ..count..), binwidth = 5, fill = "#032863", color = "#404040") +
  labs(title = "Distribution of Delayed Minutes",
       x = "Delayed Minutes",
       y = "Count of Delays") + plotTheme

3.5 What time do most NJ Transit trains run anyway?

The plots below shows the distribution of train trips across 24 hours. It seems that weekdays see a huge increase in morning trains between 7am-9am and evening rush between 4pm-6pm. On an average, weekdays have way more trains running. Weekends have a high number of trains close to midnight which may be a result of the nightlife in New Jersey and New York and die down between 1-7 am. Between 8am and 9pm, there is an average of 2000 trips. This could be beneficial information in even scheduling maintenance and checks at times that are less popular. It also highlights on what times most critical serviceability is needed: lesser delay minutes here could mean more revenue for NJ Transit which could mean more upgraded infrastructure.

grid.arrange(
  ggplot(merged_data %>% 
           mutate(hour = hour(actual_time))) +
    geom_freqpoly(aes(hour, color = dotw), binwidth = 1) + 
    labs(title = "Train schedule NJT, by day of the week, July-August, 2018",
         x = "Hour", 
         y = "Trip Counts") + 
    plotTheme,
  
  ggplot(merged_data %>% 
           mutate(hour = hour(actual_time),
                  weekend = ifelse(dotw == "Sun" | dotw == "Sat", "Weekend", "Weekday"))) +
    geom_freqpoly(aes(hour, color = weekend), binwidth = 1) +
    labs(title = "Train schedule NJT - weekend vs weekday, July-August, 2018",
         x = "Hour", 
         y = "Trip Counts") +
    plotTheme
)

4. Model building

For our use case, we decided to employ both spatial and time-series variables through appending a weather panel to our train trip panel. The resulting model data is a time-series data that looks at delay data and weather data in intervals of one hour, as well the existing spatial attributes such as station, stop, etc.

4.1 Create Panel

The following panel joins our train data and weather panel.

ride.panel <- 
  left_join(merged_data, weather.Panel,
              by = "interval60")

4.2 Create Time lags

Next, we will create new columns with time lags by an hour, 2 hours, 3 hours 4 hours and 12 hours in addition to a day. We will also create a new column called holidays that will evaluate if the day is a holiday from previous data and classify it as one.

ride.panel <- 
  ride.panel %>% 
  arrange(from_id, interval60) %>% 
  mutate(lagHour = dplyr::lag(delay_minutes,1),
         lag2Hours = dplyr::lag(delay_minutes,2),
         lag3Hours = dplyr::lag(delay_minutes,3),
         lag4Hours = dplyr::lag(delay_minutes,4),
         lag12Hours = dplyr::lag(delay_minutes,12),
         lag1day = dplyr::lag(delay_minutes,24),
         holiday = ifelse(yday(interval60) == 148,1,0)) %>%
   mutate(day = yday(interval60)) %>%
   mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
                                 dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
                                 dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
                                 dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
                                 dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
                                 dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
         holidayLag = ifelse(is.na(holidayLag) == TRUE, 0, holidayLag))

4.3 Correlation between weather and delay minutes

library(gridExtra)
library(ggplot2)
grid.arrange(
  ggplot(ride.panel) +
    geom_point(aes(x = Precipitation, y = delay_minutes), color = "#032863", alpha = 0.4) +
    geom_smooth(aes(x = Precipitation, y = delay_minutes), method = "lm", se = FALSE, color = "#BBC2DD") +
    labs(title = "Delay Minutes vs Precipitation",
         x = "Precipitation",
         y = "Delay Minutes") + plotTheme,

  ggplot(ride.panel) +
    geom_point(aes(x = Temperature, y = delay_minutes), color = "#032863", alpha = 0.4) +
    geom_smooth(aes(x = Temperature, y = delay_minutes), method = "lm", se = FALSE, color = "#BBC2DD") +
    labs(title = "Delay Minutes vs Temperature",
         x = "Temperature",
         y = "Delay Minutes") + plotTheme,

  ggplot(ride.panel) +
    geom_point(aes(x = Wind_Speed, y = delay_minutes), color = "#032863", alpha = 0.4) +
    geom_smooth(aes(x = Wind_Speed, y = delay_minutes), method = "lm", se = FALSE, color = "#BBC2DD") +
    labs(title = "Delay Minutes vs Wind Speed",
         x = "Wind_Speed",
         y = "Delay Minutes") + plotTheme,

  ggplot(ride.panel) +
    geom_point(aes(x = Visibility, y = delay_minutes), color = "#032863", alpha = 0.4) +
    geom_smooth(aes(x = Visibility, y = delay_minutes), method = "lm", se = FALSE, color = "#BBC2DD") +
    labs(title = "Delay Minutes vs Visibility",
         x = "Visibility",
         y = "Delay Minutes") + plotTheme,
  ncol = 2, 
  top = "Weather and Delay Minutes Correlation - July to August, 2018"
)

Now we will check to see if delays are affected by extreme weather conditions. From the graphs, it seems like some of the longest delays have occurred when it was not raining. As the precipitation amount increases, there are long delays. Towards the end of the graph, as amount of precipitation increases, there are fewer and shorter delay. This points to mild correlation.

In terms of Temperature, it seems like these months had an average of 70 degrees and high delays were experienced some days. Especially, very long delays were observed when the temperature of that day was at a higher end (80 or above). It seems that wind speed and visibility might not be very correlated to delays because both of them see steady delays across all wind speeds and visibility. We will try to run some models now, mixing some of these weather variables.

5. Run Models

To run different models, we first split our data into a training and a test set. We create three linear models using the lm function (linear models). The first three weeks of July (week 26-28) are used as training data and the last week of July and first week of August is used as the testing data. We create the models using our training data ride.Train. The first models include only temporal controls, but the later ones contain all of our lag information. We are using a partition that is user specified and is time dependent. We chose to split our data by time.

ride.Train <- dplyr::filter(ride.panel, week %in% c(28, 29, 30))
ride.Test <- dplyr::filter(ride.panel, week < 28)

5.1 Linear Regression Models

All our regression models contain independent variable as delay_minutes which is the number delayed minutes. Our first regression has dependent variables as interval60 which is each unique date and hour of the months of July and August, dotw or the day of the week and temperature. The second regression has dependent variables as from_cleaned which is the departure station id, day of the week and all our climatic variables like Temperature, Precipitation, Visibility and Wind speed. The third regression is similar to the second one with additional data related to lag (1,2, 3, 12 hours and one day) and lagholiday and holiday data.

reg1 <- 
  lm(delay_minutes ~  interval60 + dotw + Temperature,  data=ride.Train)

reg2 <- 
  lm(delay_minutes ~  from_cleaned + interval60 + dotw + Temperature + Precipitation + Visibility + Wind_Speed, 
     data=ride.Train)

reg3 <- 
  lm(delay_minutes ~  from_cleaned + interval60 + dotw + Temperature + Precipitation + Visibility + Wind_Speed + 
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday, 
     data=ride.Train)

5.2 K-fold cross validation

To test how well the models do, we will use a k-fold cross validation where we assign cvIDs (cross validation IDs) randomly to all unique rows. We will run this cross validation over all five weeks (that we had previously split into train and test) to see how generalizable and accurate our models are. We will show the top 5 samples from running all k-fold validations and see how well the models perform across all five models. We will split data based on time, meaning each fold represents data from a different time period. This is because we have seen that our data exhibits temporal patterns, and it helps assess how well the model generalizes to unseen time periods.

fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)
ride.Train <- st_set_geometry(ride.Train, NULL)

reg.cv1 <- 
  train(delay_minutes ~ ., data = ride.Train %>% 
          dplyr::select(delay_minutes, 
                        interval60, dotw , Temperature), 
        method = "lm", trControl = fitControl, na.action = na.pass)

reg.cv2 <- 
  train(delay_minutes ~ ., data = ride.Train %>% 
          dplyr::select(delay_minutes, 
                        from_cleaned ,interval60, dotw, Temperature, Precipitation, Visibility, Wind_Speed), 
        method = "lm", trControl = fitControl, na.action = na.pass)

reg.cv3 <- 
  train(delay_minutes ~ ., data = ride.Train %>% 
          dplyr::select(delay_minutes, 
                        from_cleaned , interval60, dotw, Temperature, Precipitation, 
                        lagHour, lag2Hours, lag3Hours, lag12Hours, lag1day, holiday, Visibility, Wind_Speed), 
        method = "lm", trControl = fitControl, na.action = na.pass)

extract_results <- function(reg_cv, model_name) {
  results <- as.data.frame(reg_cv$resample)
  results$model <- model_name
  # Filter for the first 5 folds
  results <- results[1:5, ]
  return(results)
}

all_results <- bind_rows(
  extract_results(reg.cv1, "reg.cv1"),
  extract_results(reg.cv2, "reg.cv2"),
  extract_results(reg.cv3, "reg.cv3")
)
library(kableExtra)
kable(all_results, caption = "Regression Model Cross-Validation Results", align = "c") %>%
  kable_styling("striped", full_width = FALSE)
Regression Model Cross-Validation Results
RMSE Rsquared MAE Resample model
5.325470 0.0022141 3.168298 Fold001 reg.cv1
5.465386 0.0051035 3.241763 Fold002 reg.cv1
5.476234 0.0134466 3.227853 Fold003 reg.cv1
5.350452 0.0069206 3.174769 Fold004 reg.cv1
5.029503 0.0092997 3.127089 Fold005 reg.cv1
5.549391 0.0234141 3.222613 Fold001 reg.cv2
4.659120 0.0447820 2.976860 Fold002 reg.cv2
4.767482 0.0845974 2.971308 Fold003 reg.cv2
4.573005 0.0676700 2.938677 Fold004 reg.cv2
5.711818 0.0464883 3.294688 Fold005 reg.cv2
5.907981 0.0504834 3.268331 Fold001 reg.cv3
5.860619 0.0349745 3.111318 Fold002 reg.cv3
5.633685 0.0613958 3.215898 Fold003 reg.cv3
5.165269 0.0599809 3.126408 Fold004 reg.cv3
4.737500 0.0768399 2.927418 Fold005 reg.cv3

Now we will look at the first five folds for each of our regressions across Root Mean Squared Error(RMSE), R-squared, and Mean absolute error(MAE).

RMSE (Root Mean Squared Error): This metric measures the average magnitude of the errors between predicted and actual values. For the best model, we are looking at lower RMSE values and for the model to generalize well, we are looking for consistency across folds. Regression 1 has high RMSE values which indicates poor model performance but they are fairly consistent. Similarly, reg2 and reg 3 have similar RMSE values but they fluctuate quite a bit.

Rsquared: This metric measures the proportion of the variance in the dependent variable that is predictable from the independent variables. A higher R-squared value suggests that a larger proportion of the variance in the dependent variable is explained by the model. Higher R-squared values are desirable. We are looking at higher r-squared values and more consistency. regression 1,2 and 3 have very low r-squared values that are lower than 10% dependability.

MAE (Mean Absolute Error): This metric represents the average absolute difference between predicted and actual values. Like RMSE, lower MAE values indicate better model performance. The MAEs decrease as we go down the table. This means that reg1 has the highest MAE which is still not too much while reg3 has the least mean absolute error.

6. Predict for test data

Now we create a nested dataframe and a function called model_pred that will be used to test data.

ride.Test.weekNest <- 
  ride.Test %>%
  nest(data = -week) 
model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

6.1 Predictions

ride.Test.weekNest is a nested data frame which has data grouped into subsets, and each subset (or “nest”) contains a data frame.week_predictions, is a data frame where each row represents a combination of observed and predicted values along with corresponding metrics.

week_predictions <- ride.Test.weekNest %>% 
  mutate(
    reg1_FE = map(.x = data, fit = reg1, .f = model_pred),
    reg2_FE = map(.x = data, fit = reg2, .f = model_pred),
    reg3_FE = map(.x = data, fit = reg3, .f = model_pred)
  ) %>%
  gather(Regression, Prediction, -data, -week) %>%
  mutate(
    Observed = map(data, pull, delay_minutes),
    Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
    MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
    sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE)
  )

week_predictions
## # A tibble: 6 × 8
##    week data   Regression Prediction     Observed Absolute_Error   MAE sd_AE
##   <dbl> <list> <chr>      <list>         <list>   <list>         <dbl> <dbl>
## 1    26 <sf>   reg1_FE    <dbl [4,575]>  <dbl>    <dbl [4,575]>   3.18  4.12
## 2    27 <sf>   reg1_FE    <dbl [49,139]> <dbl>    <dbl [49,139]>  3.53  5.56
## 3    26 <sf>   reg2_FE    <dbl [4,575]>  <dbl>    <dbl [4,575]>   2.84  4.08
## 4    27 <sf>   reg2_FE    <dbl [49,139]> <dbl>    <dbl [49,139]>  3.38  5.50
## 5    26 <sf>   reg3_FE    <dbl [4,575]>  <dbl>    <dbl [4,575]>   2.78  3.78
## 6    27 <sf>   reg3_FE    <dbl [49,139]> <dbl>    <dbl [49,139]>  3.34  5.20

6.2 Examining Error Metrics for Accuracy

It’s apparent that holiday time lags don’t seem to matter probably because most US holidays lie on the weekend. Now we can visualize error metrics to evaluate the accuracy of the different regressions and determine the best one.

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette5) +
    labs(title = "Mean Absolute Errors by model specification and week") +
  plotTheme

As previously stated, the third regression model (the lag model) is strongest in terms of its lower MAE, meaning it is the most accurate. For example, regression 3’s MAE ranges from between 2.5-3.5 minutes, meaning it predicts delay minutes with a mean absolute error above or below the observed value.

week_predictions %>%
  mutate(interval60 = map(data, pull, interval60),
         from_cleaned = map(data, pull, from_cleaned)) %>%
  dplyr::select(interval60, from_cleaned, Observed, Prediction, Regression) %>%
  unnest(cols = c(interval60, from_cleaned, Observed, Prediction)) %>%
  gather(Variable, Value, -Regression, -interval60, -from_cleaned) %>%
  group_by(Regression, Variable, interval60) %>%
  summarize(Value = sum(Value, na.rm = TRUE), .groups = 'drop') %>%
  ggplot(aes(interval60, Value, colour = Variable)) + 
  geom_line(size = 1.1) + 
  scale_color_manual(values = c("#032863", "#BBC2DD")) +
  facet_wrap(~Regression, ncol = 1) +
  labs(
    title = "Predicted/Observed Delay Minutes",
    subtitle = "NJ Transit July-August 2018",
    x = "Hour",
    y = "Delay Minutes"
  ) +
  plotTheme

Overall, our models seem to fit moderately with the observed values which indicates some accuracy. Notice the predictions in light blue tend to overlap with the observed values in dark blue well. Regression 3 seems to be performing the best and hence we will use it further. Next we look at mean absolute errors by station. From the graph below, there seems to be a spatial pattern to the errors. Specifically the central part of NJ has more mean absolute errors than any other part. Also, errors are not usually on the lower end which indicates moderate accuracy.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           from_cleaned = map(data, pull, from_cleaned), 
           lat1 = map(data, pull, LATITUDE), 
           lng1 = map(data, pull, LONGITUDE)) %>%
    select(interval60, from_cleaned, lng1, lat1, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "reg3_FE") %>%
  group_by(from_cleaned, lng1, lat1) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = st_union(njstate), color = "grey", fill = "transparent")+
  geom_point(aes(x = lng1, y = lat1, color = MAE), 
             fill = "transparent", alpha = 0.4)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "rocket")+
  ylim(min(stationdelays$LATITUDE), max(stationdelays$LATITUDE))+
  xlim(min(stationdelays$LONGITUDE), max(stationdelays$LONGITUDE))+
  labs(title="Mean Absolute Error, Test Set, Model 3")+
  mapTheme

Lastly, in the above map, we can see which stops/lines have the highest MAE values, thus are the least accurate. The darkest (highest MAE) value points appear mostly along the Raritan Valley Line and the North Jersey Coast Line, suggesting greater attention and monitoring be in place there, while the northern lines like the Pascack Valley Line, Gladstone Branch, and Main Line have lower MAE values, indicating our model performs better there.

6.3 Space-Time Error Evaluation

Next we will plot observed vs. predicted for different times of day during the week and weekend. The plots of the weekends and weekdays show that our model is best at predicting delay times towards the beginning of the day, but observed values vary from predicted values far more for evening and overnight delays, particularly during the weekdays.

week_predictions %>% 
  mutate(
    interval60 = map(data, pull, interval60),
    from_cleaned = map(data, pull, from_cleaned), 
    lat1 = map(data, pull, LATITUDE), 
    lng1 = map(data, pull, LONGITUDE),
    dotw = map(data, pull, dotw)
  ) %>%
  select(
    interval60, from_cleaned, lng1, lat1, Observed, Prediction, Regression, dotw
  ) %>%
  unnest() %>%
  dplyr::filter(Regression == "reg3_FE") %>%
  mutate(
    weekend = ifelse(dotw == "Sun" | dotw == "Sat", "Weekend", "Weekday"),
    time_of_day = case_when(
      hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
      hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
      hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
      hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"
    )
  ) %>%
  ggplot() +
  geom_point(aes(x = Observed, y = Prediction), color = "#032863", alpha = 0.4) +
  geom_smooth(aes(x = Observed, y = Prediction), method = "lm", se = FALSE, color = "#BBC2DD") +
  geom_abline(slope = 1, intercept = 0) +
  facet_grid(time_of_day ~ weekend) +
  labs(
    title = "Observed vs Predicted",
    x = "Observed trips", 
    y = "Predicted trips"
  ) +
  plotTheme

Now we will look at our errors on a map by weekend/weekday and time of day. In general, accuracy and MAE value appears somewhat line-dependent, meaning errors will occur along a particular line. If we look at the maps horizontally, across the time of day, MAE values become greater over the course of the day from day to night, as discussed in the previous plots as well. For weekdays in particular, it is the Raritan Valley line that becomes most increasingly inaccurate amongst the lines, while on weekends, it is most visibly the North Jersey Coast Line, Morristown Line, and Raritan Valley Line. The errors in the model for these lines suggest more data, feature engineering, etc. that may better account for these delays.

week_predictions %>%
  mutate(
    interval60 = map(data, pull, interval60),
    from_cleaned = map(data, pull, from_cleaned),
    lat1 = map(data, pull, LATITUDE),
    lng1 = map(data, pull, LONGITUDE),
    dotw = map(data, pull, dotw)
  ) %>%
  select(
    interval60, from_cleaned, lng1,
    lat1, Observed, Prediction, Regression,
    dotw
  ) %>%
  unnest() %>%
  filter(Regression == "reg3_FE") %>%
  mutate(
    weekend = ifelse(dotw == "Sun" | dotw == "Sat", "Weekend", "Weekday"),
    time_of_day = case_when(
      hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
      hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
      hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
      hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"
    )
  ) %>%
  group_by(from_cleaned, weekend, time_of_day, lng1, lat1) %>%
  summarize(MAE = mean(abs(Observed - Prediction), na.rm = TRUE)) %>%
  ggplot() +
  geom_sf(data = st_union(njstate), color = "darkgrey", fill = "transparent") +
  geom_point(
    aes(x = lng1, y = lat1, color = MAE),
    fill = "transparent", size = 0.7, alpha = 0.7
  ) +
  scale_colour_viridis_c(direction = -1, option = "mako") +
  ylim(min(stationdelays$LATITUDE), max(stationdelays$LATITUDE)) +
  xlim(min(stationdelays$LONGITUDE), max(stationdelays$LONGITUDE)) +
  facet_grid(weekend ~ time_of_day) +
  labs(title = "Mean Absolute Errors, Test Set") +
  mapTheme

7. Results and discussion

7.1.1 Generalizability

The model saw 2 weeks of training data and 3 weeks of test data and was made with a variety of diverse and representation datasets. The k-fold Cross-validation test showed that the model was fairly generalizable because the RMSE values were on the low end.

7.1.2 Accuracy

The model does alright in terms of accuracy because it has high Mean absolute errors (more than 3 for the test months). As seen in 6.2, the model performs well in predicting for most days and does not predict well for outliers (like July 3rd in Section 6.2). There are also more errors in stations with higher delay minutes. This shows the fluctuating nature of delays and strengthens the call for repairs in those heavily delayed stations.

7.2 How does the model fit with use case?

Our use case set out to identify delays specific to rail lines to schedule repair and check. By accurately predicting delayed minutes, we are able to spatially determine which stations have higher delays and thus helping NJ Transit decide how to be more proactive in selecting stations that could in the future experience more delays.

7.3 How could the analysis be bettered?

These are some ideas that could make the model perform stronger:

  1. Additional Variables:
    1. Urban Heat Island Effect: One could take the percent of built surface around stations in 500m buffer of the station to account for how the urban heat island effect is affecting tracks.
    2. Tree Coverage: Area covered by trees in 500m buffer of the station. This could show obstacles on tracks like in the Fall, NJ Transit spends revenue on clearance of leaves from tracks and during summer, trees could help reduce temperature of tracks.
  2. Iterations of more regression models to make best prediction. We need to use spacial fixed effect and some more testing of models to come up with smaller Mean errors and higher R squared.

In conclusion, we believe our project is a proactive approach to track maintenance through the lens of delay consequences. It is cost reducing, and can increase resilience and customer satisfaction.

8. Final thoughts and Wireframe of website

We recommend NJ transit to use our model to help predict and plan. Our government-facing website, is set to include information around, first, daily statistics of where and by how long trains are delayed, second, track health prediction for 2 months ahead: including where will tracks be under most stress from climatic hazards, and third, a record for NJT authorities to schedule checks and repairs.

Our wireframing is here

Watch our pitch here.