ENGINEERING A BETTER COMMUTE

crowded train platform

\[\\[0.25in]\]

If you have ever felt like NJ Transit needed work, you are not wrong. In January 2019, stations across the New Jersey transit system experienced the equivalent of about 1 year 8 months worth of delays.

But what if you could know your train was running late before it was running late?

At eNJine, we are making this possible.

No more rushing, only to wait. No more missing your connection. No arriving late to appointments. At eNJine, our data scientists are dedicated to engineering a better commute for you. Our beta testing has focused on proving the feasibility of using machine learning to improve commuting. Currently, eNJine can predict with 93 percent accuracy whether trains will be departing within two minutes of their scheduled departure times using information available two hours ahead. This is proof that it is possible to engineer a better commute.

Our Data

The beauty of our model is that it is simple. It can be easily replicated not only for other NJ Transit lines, but other transit systems worldwide. The model relies solely on data from NJ Transit service and it is the kind of data that transit systems likely already collect. From there, we engineered a few features that are powerful indicators of lateness. For example, the animations above shows how lateness varies not only at each stop, but also at certain times of day.

eNJine’s model relies on publicly available data about NJ Transit’s scheduled arrivals and departures (including train line, stop sequence, train id) as well as information about these trains’ actual arrival and departure times. Our data scientists have also incorporated publicly available weather data as well as coordinates of NJ Transit stations.

Our model works because we understand how New Jersey transit works.

The engineers at eNJine have spent weeks working to better understand what makes NJ Transit’s commuter rails run late. Below, you can see how train delays vary across days of the week and stops. With these and other insights, we dial into exactly what causes New Jersey trains to be late.

time lapse of train lateness, Gladstone Branch

time lapse of train lateness, Bergen Co. Line

Beyond this, our researchers have investigated how various categorical variables influence how lateness is distributed across different categories. Lateness varies across lines: the Bergen Co. Line is somewhat more likely than the Gladstone Branch Line to run late. Similarly, the plot of variation in lateness across every day of the week likewise shows that trains tend to be late more or less often on different days of the week and that in general trains run with fewer delays on weekends.

catvars<-panelnoGEOM%>% dplyr::select(y2, #from,
                             line,
                             weekday)%>%
  gather(Variable, value, -y2)%>%
  count(Variable, value, y2)%>%
  ggplot(., aes(value, n, fill=y2))+
  geom_bar(position="dodge", stat = "identity")+
  facet_wrap(~Variable, scales="free")+
  scale_fill_manual(values=palette2)+
  labs(x="y", y="Value",
       title = "Feature Associations with the Likelihood\nof a Train Late More Than 2 Minutes",
       subtitle = "All Catagorical Feats",
       caption= "Fig.2a")+
  theme(legend.position = "bottom", axis.text.x=element_text(angle=45, hjust=1))+plotTheme2()

catvars

Similarly, our engineers have studied how the distribution of trains being at least two minutes late (or not) corresponds to the values of numeric variables.

Significant differences in these distributions are apparent at various values. This is particularly clear in the plot for hour as a numeric variable: commuter trains are much more likely to run on time before seven am than afterward (n.b., trains are also more likely to run on time later in the day as well: future versions of the model might also include a categorical variable to further dial in on this variation). Likewise, trains are much less likely to depart late when there are fewer departures per hour. Finally, there are several plots representing a temporal lag of lateness in the transit system as well as plots of the square root of those lags; while these plots look similar, the plots of the roots show clearer variation, and these features will ultimately be more useful in developing eNJine’s model.

numvars<- panelnoGEOM%>%dplyr::select(y2, total_departures,
                                        hour,
                                        onestoplag, 
                                        onestopearlieron,
                                        twostopearlieron,
                                        sqrtlag3Hours,
                                        lag3Hourstemp) %>%
  gather(Variable, value, -y2) %>%
  ggplot() + 
  geom_density(aes(value, color=y2), fill = "transparent") + 
  facet_wrap(~Variable, scales = "free") +
  scale_color_manual(values = palette2) +
  labs(title = "Feature Distributions with the Likelihood\nof a Train Late More Than 2 Minutes",
       subtitle = "All Numeric Outcomes",
       caption= "Fig.2b") +
  theme(legend.position = "bottom")+plotTheme2()

numvars

Our model accounts for lateness across the system.

Our engineers created features for the model to represent how behind schedule trains on each line were running at the same time during the week prior (lag1Week) and the day before (lag1Day) as well as earlier that day (lag2hours, lag3hours, etc.). These relationships are not linear, and so the square roots of these spatial lags are incorporated into the model.

Our engineers also designed a feature to represent lateness at nearby stops earlier in the day (onestopearlieron, twostopearlieron). Their approach to engineering this feature relied on information about individual departures and delays. Having this data allowed our engineers to create spatial lags before aggregating these delays to the interval hour and performing a temporal lag. This provides more specific information about how the system is running than the simple temporal lag and can likely act as a stand in for information about track malfunctions, etc.

Our Approach

Our engineers experimented with a number of approaches before arriving at our final beta model. This model relies on a logistic regression to make binary predictions about whether trains, running on a particular line, within “interval hours” will be more than two minutes late. eNJine can do all this using only data available two hours prior to the scheduled departure time.

Our engineers experimented with building separate models for different lines; ultimately, however, they determined it was possible to achieve similar results by continuing to operate with one model that includes lines as a categorical variable. In the future, our engineers may find it useful to return to this line-specific method: however, the two lines included in our proof-of-concept work operate similarly enough that this is not necessary.

In order to determine what numeric variables would be most useful for the model, our engineers relied on a correlation plot that demonstrates the relationship between various features.

numericVars <- panelnoGEOM%>%dplyr::select(all_of(Model1.vars), y2_numeric)%>%
  select_if(is.numeric) %>% na.omit()

corrplot <- ggcorrplot( 
  round(cor(numericVars), 1), 
  p.mat = cor_pmat(numericVars),
  colors = c("#260a03", "white", "#a81010"),
  type="lower",
  insig = "blank",
  lab = TRUE) +  
    labs(title = "Correlation Across Numeric Variables",
         caption = "Fig.3")+plotTheme2()

corrplot

This plot allowed our researchers to determine not only which variables were correlated with lateness, but also give them insight into the interactions between variables. This proved especially useful in choosing which variables to include as our engineers were limited in the computing power available to them and so built a model using fewer features than they might have otherwise.

To determine what categorical features to include, our researchers relied on their initial exploratory analysis, included above.

With more investment and computing power, we will be able to improve the accuracy of our model by increasing the specificity of inputs: i.e., building a model that takes the direction a train is travelling into account as well as iterating over shorter time intervals.

Our model works

By cross-validating, our engineers were able to determine that the beta-model currently correctly predicts whether a train will be late 93 percent of the time. We think that this is pretty good for a beta model. It is, however, worth noting that currently our model is better at predicting when a train is going to be on time than when it is going to be late.

ModelcvFit1.plot<- dplyr::select(ModelcvFit1$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(ModelcvFit1$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
  geom_histogram(bins=35, fill = "#93afdc") +
  facet_wrap(~metric) +
  geom_vline(aes(xintercept = mean), colour = "#a81010", linetype = 3, size = 1.5) +
  scale_x_continuous(limits = c(0, 1)) +
  labs(x="Goodness of Fit", y="Count", title="Model 1 Goodness of Fit Metrics",
       subtitle = "Across-fold mean reprented as dotted lines",
       caption= "Fig.4") +
  plotTheme2()

ModelcvFit1.plot

Of course, for eNJine to be useful to commuters, it is imperative not only that our engineers have the resources to minimize errors, but also that we balance their directionality when they do occur. If we fail to do this well, eNJine won’t be useful to commuters. Even minor errors can cause problems and result in commuters missing their trains (if we say a train will be late and it isn’t) or their connections (if we say a train won’t be late and it is).

There are various ways to envision the trade offs of a logistic model: below we have included both a distribution of probabilities as well as an ROC curve.

 model1plot<- ggplot(ModelProbs1, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
   geom_vline(xintercept = 0.25,  
                color = "#93afdc", size=2, alpha=0.5)+
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) +
  labs(x = "Late Over 2 Mins", y = "Density of probabilities",
       title = "Model 1: Distribution of Predicted\nProbabilitiesby Actual Outcome",
       subtitle = "Threshold: 0.25", 
       caption= "Fig.5") +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none")+plotTheme2()


Modelroc.1<- ggplot(ModelProbs1, aes(d = as.numeric(Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#a81010") +
     geom_vline(xintercept = 0.25,  
                color = "#93afdc", size=2, alpha=0.5)+
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve - Model 1", subtitle = paste("AUC:", round(Modelauc1[1], digits=3)) )+plotTheme2()


plot_grid(model1plot, Modelroc.1, align = "h")

In useful ways, these plots demonstrate the trade offs that eNJine’s data scientists are forced to make to turn a probabilistic measure of a train being at least two minutes late into a useful binary answer for commuters. Below, we illustrate our errors in our predictions using a 0.25 threshold.

thresh <- ModelProbs1.thresholds%>%filter(Threshold == .25)%>%dplyr::select(contains("Rate"), Accuracy)%>%gather(Variable, Value)

resultsbar <- thresh%>%
    ggplot(aes(Variable, Value), fill = "#260a03",  colour= "white") +
      geom_bar( position = "dodge", stat = "identity", fill = "#260a03") +
      labs(title="Model Results",
           caption = "Fig.6", 
           x = "Outcome",y = "Rate") +
      plotTheme2() + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "bottom") 

resultsbar

Currently, our model is likely to generate false negatives, predicting a train will be late when it is not, than it is to generate false positives predicting a train will be on-time, when it will really be late. At eNJine we believe that tradeoff is worthwhile. As a result, commuters may arrive on time only to have to wait when we have told them they will not have to, but we believe that is better than having the errors in the other direction—which would result in them missing their train entirely.

Our engineers are always making our model better.

For this proof-of-concept report, our scope was limited to just two lines on New Jersey Transit: the Bergen Co. Line and the Gladstone Branch. Obviously, this tool is only helpful if you are a commuter on one of those two lines, so in future iterations of this project it would be wise to include all the lines.

In terms of powerful features, we were surprised by how little weather impacted the models. It would be useful to identify other external factors that impact train delays. Our engineers are interested in including public health data in future versions of the model, as they believe this may act as a useful proxy for worker absenteeism (likewise: drinking holidays). Future models may also be improved by incorporating information about specific models of trains as well as planned maintenance and service disruptions.

Additionally, our current product incorporates a few different lag features, one of which is based on whole hour intervals. We recognize that that may not be the best approach in engineering a time-based lag for New Jersey Transit, but we simply lack the computing power to model shorter intervals. Similarly, we were unable to incorporate a significant route direction feature, NJ train interaction with freight lines, and employee attendance.

Michael Scott struggles to get on a moving freight train

Don’t be this guy. Use eNJine.

\[\\[1in]\]

Code Library

# Libraries -----------------
library(tidyr)
library(dplyr)
library(lubridate)
library(sf)
library(tidycensus)
library(tidyverse)
library(tigris)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(caret)
library(knitr) 
library(pscl)
library(plotROC)
library(gtsummary) #for table cause broom always f's everything up
library(pROC)
library(tibble)
library(cowplot)
## FOR MAPS ~~
library(RColorBrewer)
library(ggmap)  
library(maps)
library(mapdata)
library(ggcorrplot)



options(tigris_class = "sf")
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")


palette6 <- c("#260a03", "#a81010", "#f8cd12", "#f2fdff", "#93afdc", "#0c0e1d" )

palette5 <- c("#260a03", "#a81010", "#f8cd12", "#f2fdff", "#93afdc", "#0c0e1d" )

palette4 <- c("#a81010", "#f8cd12", "#93afdc", "#0c0e1d" )

palette3 <- c("#a81010", "#f8cd12", "#93afdc")

palette2 <- c("#260a03", "#a81010")

palette2b <- c( "#260a03", "#f8cd12")

plotTheme2 <-function(base_size = 12, title_size = 18) {
  theme(
    panel.background =element_blank(), 
    text = element_text( color = "#0c0e1d", family="Helvetica"),
    plot.title = element_text(size = title_size, colour = "#0c0e1d", family="Helvetica", face="bold"), 
    plot.subtitle = element_text(face="italic", family="Helvetica", colour = "#0c0e1d"),
    plot.caption = element_text(hjust=0, colour = "#0c0e1d", family="Helvetica"),

    axis.ticks.y = element_line(color = "grey80", size = 0.1),
    plot.background =element_blank(),
    
    panel.grid.major.x   = element_line("grey80", size =0.1 ),
    panel.grid.major.y   =element_line("grey80", size = 0.05),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "#93afdc", fill=NA, size=4),
    #panel.spacing = unit(c(2,2,2,2), "cm"),
    strip.background = element_rect(fill = "#93afdc", color = "#93afdc"),
    strip.text = element_text(size=12,color ="#0c0e1d", family="Helvetica"),
    axis.title = element_text(size=12, color ="#0c0e1d", family="Helvetica",  face="italic"),
    axis.text = element_text(size=10, color ="#0c0e1d", family="Helvetica",),
    
    legend.background = element_blank(),
    legend.title = element_text(colour = "#0c0e1d", face = "italic", family="Helvetica"),
    legend.text = element_text(colour = "#0c0e1d", face = "italic", family="Helvetica"),
    strip.text.x = element_text(size = 14, family="Helvetica", colour = "#0c0e1d")
  )
}


plotTheme3 <-function(base_size = 12, title_size = 18) {
  theme(
    panel.background=element_rect(fill = "#93afdc", color = "#f2fdff"), 
    text = element_text( color = "#0c0e1d", family="Helvetica"),
    plot.title = element_text(size = title_size, colour = "#0c0e1d", family="Helvetica", face="bold"), 
    plot.subtitle = element_text(face="italic", family="Helvetica", colour = "#0c0e1d"),
    plot.caption = element_text(hjust=0, colour = "#0c0e1d", family="Helvetica"),

    axis.ticks.y = element_line(color = "grey80", size = 0.1),
    plot.background =element_blank(),
    
    panel.grid.major.x   = element_line("grey80", size =0.1 ),
    panel.grid.major.y   =element_line("grey80", size = 0.05),
    panel.grid.minor = element_blank(),
    panel.border =element_blank(),
    #panel.spacing = unit(c(2,2,2,2), "cm"),
    strip.background = element_rect(fill = "#f2fdff", color = "#f2fdff"),
    strip.text = element_text(size=12,color ="#0c0e1d", family="Helvetica"),
    axis.title = element_text(size=12, color ="#0c0e1d", family="Helvetica",  face="italic"),
    axis.text = element_text(size=10, color ="#0c0e1d", family="Helvetica",),
    
    legend.background = element_blank(),
    legend.title = element_text(colour = "#0c0e1d", face = "italic", family="Helvetica"),
    legend.text = element_text(colour = "#0c0e1d", face = "italic", family="Helvetica"),
    strip.text.x = element_text(size = 14, family="Helvetica", colour = "#f2fdff")
  )
}



mapTheme2<-function(base_size = 12, title_size = 16) {
  theme(
    text = element_text( color = "#0c0e1d", family="Helvetica"),
    plot.title = element_text(size = title_size,colour = "#0c0e1d", family="Helvetica"),
    plot.subtitle=element_text(face="italic", family="Helvetica"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "#93afdc", fill=NA, size=2),
    strip.text.x = element_text(size = 14, color="#0c0e1d", family="Helvetica"),
    strip.background  = element_rect(fill="#93afdc", color="#93afdc"))
}













# RawData -----------
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
jan2019 <- read.csv("https://raw.githubusercontent.com/bri-ne/MUSA508-Final/main/data/2019_01.csv?token=AVOS24NFX2PEGNUZQUPRNY3BXYPQY")
feb2019 <- read.csv("https://raw.githubusercontent.com/bri-ne/MUSA508-Final/main/data/2019_02.csv?token=AVOS24PUUJDM2PW36N5T6K3BXYPQ4")
mar2019 <- read.csv("https://raw.githubusercontent.com/bri-ne/MUSA508-Final/main/data/2019_03.csv?token=AVOS24MRII4RKEMYGNWWUSDBXYPQ6")
weather.Data <- 
  riem_measures(station = "EWR", date_start = "2019-01-01", date_end = "2019-05-01")





# Warngledate ----
njtransitdf <- rbind(jan2019, feb2019, mar2019)


njtransitdf <- njtransitdf %>% 
  mutate(scheduled_time_dt = as_datetime(scheduled_time),
         actual_time_dt = as_datetime(actual_time),
         delay_dt = seconds(as.integer(delay_minutes*60)), 
         intervalhour = floor_date(scheduled_time_dt, unit= "hour"),
         week = week(intervalhour)) %>%
  na.omit(delay_dt)







# Stations ----

stoplocations <- read.csv("https://raw.githubusercontent.com/bri-ne/MUSA508-Final/main/Rail_Stations_of_NJ_Transit.csv?token=AVSVPEVU7X6RXZP3RNOK2CLBW7VZO")

stoplocations <- stoplocations %>%rename(X = "ï..X")%>%st_as_sf(coords = c("X", "Y"), crs =3424, agr = "constant") 

#stoplocations <- stoplocations %>%st_as_sf(coords = c("X", "Y"), crs =3424, agr = "constant") 







# Harmonizing Station Name for Geoms for Maps  ----

  ## First, merge with stop data:
  

station_id <- unique(stoplocations$STATION_ID)

  ### Getting the names the same 
      
      ## GLAD ~~~~

train.weather.panel.lags.final$from<-replace(train.weather.panel.lags.final$from, train.weather.panel.lags.final$from == "Hoboken", "Hoboken Terminal")
train.weather.panel.lags.final$to<-replace(train.weather.panel.lags.final$to, train.weather.panel.lags.final$to == "Hoboken", "Hoboken Terminal")

train.weather.panel.lags.final$from<-replace(train.weather.panel.lags.final$from, train.weather.panel.lags.final$from == "Secaucus Upper Lvl", "Secaucus Junction Lower Level")
train.weather.panel.lags.final$to<-replace(train.weather.panel.lags.final$to, train.weather.panel.lags.final$to == "Secaucus Upper Lvl", "Secaucus Junction Lower Level")

stoplocations <- stoplocations%>%mutate(STATION_ID = ifelse(COUNTY == "Orange, NY" & STATION_ID == "Middletown", "Middletown NY", STATION_ID))

train.weather.panel.lags.final$from<-replace(train.weather.panel.lags.final$from, train.weather.panel.lags.final$from == "Broadway Fair Lawn", "Broadway")
train.weather.panel.lags.final$from<-replace(train.weather.panel.lags.final$from, train.weather.panel.lags.final$from == "Glen Rock Boro Hall", "Glen Rock-Boro Hall")
train.weather.panel.lags.final$from<-replace(train.weather.panel.lags.final$from, train.weather.panel.lags.final$from == "Hoboken", "Hoboken Terminal")
train.weather.panel.lags.final$from<-replace(train.weather.panel.lags.final$from, train.weather.panel.lags.final$from == "Radburn Fair Lawn", "Radburn")
train.weather.panel.lags.final$from<-replace(train.weather.panel.lags.final$from, train.weather.panel.lags.final$from == "Ramsey Main St", "Ramsey")
train.weather.panel.lags.final$from<-replace(train.weather.panel.lags.final$from, train.weather.panel.lags.final$from == "Ramsey Route 17", "Rte 17 Ramsey")
train.weather.panel.lags.final$from<-replace(train.weather.panel.lags.final$from, train.weather.panel.lags.final$from == "Secaucus Lower Lvl", "Secaucus Junction Lower Level")


train.weather.panel.lags.final$to<-replace(train.weather.panel.lags.final$to, train.weather.panel.lags.final$to == "Broadway Fair Lawn", "Broadway")
train.weather.panel.lags.final$to<-replace(train.weather.panel.lags.final$to, train.weather.panel.lags.final$to == "Glen Rock Boro Hall", "Glen Rock-Boro Hall")
train.weather.panel.lags.final$to<-replace(train.weather.panel.lags.final$to, train.weather.panel.lags.final$to == "Hoboken", "Hoboken Terminal")
train.weather.panel.lags.final$to<-replace(train.weather.panel.lags.final$to, train.weather.panel.lags.final$to == "Radburn Fair Lawn", "Radburn")
train.weather.panel.lags.final$to<-replace(train.weather.panel.lags.final$to, train.weather.panel.lags.final$to == "Ramsey Main St", "Ramsey")
train.weather.panel.lags.final$to<-replace(train.weather.panel.lags.final$to, train.weather.panel.lags.final$to == "Ramsey Route 17", "Rte 17 Ramsey")
train.weather.panel.lags.final$to<-replace(train.weather.panel.lags.final$to, train.weather.panel.lags.final$to == "Secaucus Lower Lvl", "Secaucus Junction Lower Level")





# StudyPanel ----

train.template <- 
  filter(njtransitdf, week %in% c(1: 10)) %>%
  filter(line == "Bergen Co. Line " | line == "Gladstone Branch")%>%
  mutate(train_counter=1)%>%
  group_by(line, intervalhour, from) %>%
  summarize(train_count = sum(train_counter, na.rm=T),
            late = mean(delay_dt))

study.panel <- 
  expand.grid(intervalhour = unique(train.template$intervalhour), 
              from = unique(train.template$from),
              line= unique(train.template$line))





# TrainPanel ----


train.panel <- train.template %>%
  right_join(study.panel)





# Weather  ----

weather.Panel <-  
  weather.Data %>%
    mutate_if(is.character, list(~replace(as.character(.), is.na(.), "0"))) %>% 
    replace(is.na(.), 0) %>%
    mutate(intervalhour = ymd_h(substr(valid, 1, 13))) %>%
    mutate(dotw = wday(intervalhour, label=TRUE)) %>%
    group_by(intervalhour) %>%
    summarize(Temperature = max(tmpf),
              Percipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature)) 

train.weather.panel <- merge(x=train.panel, y= weather.Panel, by= 'intervalhour', all.x= T)


train.weather.panel[c("late","train_count")][is.na(train.weather.panel[c("late","train_count")])] <- 0 






# Building Time Lags ----


lag.df <- train.weather.panel %>%
  group_by(line, intervalhour) %>%
  summarise(avg_late = mean(late),
            total_departures = sum(train_count),
            Temperature = mean(Temperature), 
            Percipitation = mean(Percipitation),
            Wind_Speed = mean(Wind_Speed)) %>%
  mutate(lag2Hours = dplyr::lag(avg_late,2),
         lag3Hours = dplyr::lag(avg_late,3),
         lag4Hours = dplyr::lag(avg_late,4),
         lag12Hours = dplyr::lag(avg_late,12),
         lag1day = dplyr::lag(avg_late,24),
         lag1week = dplyr::lag(avg_late, 168), 
         sqrtlag2Hours = sqrt(lag2Hours),
         sqrtlag3Hours = sqrt(lag3Hours),
         sqrtlag4Hours = sqrt(lag4Hours),
         sqrtlag12Hours = sqrt(lag12Hours),
         sqrtlag1day = sqrt(lag1day),
         sqrtlag1week = sqrt(lag1week),
         sqrtlag2hourstotal_departures = sqrt(dplyr::lag(total_departures)),
         lag2Hourswind = dplyr::lag(Wind_Speed,2),
         lag3Hourswind = dplyr::lag(Wind_Speed,3),
         lag4Hourswind = dplyr::lag(Wind_Speed,4),
         lag12Hourswind = dplyr::lag(Wind_Speed,12),
         lag3Hourstemp = dplyr::lag(Temperature,3),
        lag3HoursPercipitation = dplyr::lag(Percipitation,3))

train.weather.panel.lags = merge(train.weather.panel, lag.df, by=c("line", "intervalhour"))

train.weather.panel.lags[c("lag2Hours","lag3Hours","lag4Hours", "lag12Hours", "lag1day", "lag1week",
                           "sqrtlag2Hours", "sqrtlag3Hours", "sqrtlag4Hours", "sqrtlag12Hours", "sqrtlag1day", "sqrtlag1week", "sqrtlag2hourstotal_departures", "lag2Hourswind", "lag3Hourswind", "lag4Hourswind", "lag12Hourswind", "lag3Hourstemp", "lag3HoursPercipitation")][is.na(train.weather.panel.lags[c("lag2Hours","lag3Hours","lag4Hours", "lag12Hours", "lag1day", "lag1week", "sqrtlag2Hours", "sqrtlag3Hours", "sqrtlag4Hours", "sqrtlag12Hours", "sqrtlag1day", "sqrtlag1week", "sqrtlag2hourstotal_departures", "lag2Hourswind", "lag3Hourswind", "lag4Hourswind", "lag12Hourswind", "lag3Hourstemp", "lag3HoursPercipitation")])] <- 0 





# Space Lags ----

stoplag<- njtransitdf %>%
 filter(week %in% c(1: 10)) %>%
 filter(line == "Bergen Co. Line " | line == "Gladstone Branch")%>%
 group_by(line, train_id)%>%arrange(stop_sequence)%>%
  mutate(onestoplag = dplyr::lag(delay_minutes),
         twostoplag = dplyr::lag(delay_minutes,2)) %>%
  ungroup() %>%
  group_by(line, from, intervalhour) %>%
  summarise(onestoplag = mean(onestoplag), twostoplag=mean(twostoplag)) %>%
  mutate(onestopearlieron = dplyr::lag(onestoplag,1), twostopearlieron = dplyr::lag(twostoplag,2))%>%
  dplyr::select(intervalhour, line, from, onestoplag, twostoplag, onestopearlieron, twostopearlieron) 

stoplag[c("onestoplag", "twostoplag", "onestopearlieron", "twostopearlieron")][is.na(stoplag[c("onestoplag", "twostoplag", "onestopearlieron", "twostopearlieron")])] <- 0

train.weather.panel.lags.final <- left_join(train.weather.panel.lags, stoplag, by=c("line", "intervalhour", "from"))






# Final Setup ----

panelnoGEOM <- train.weather.panel.lags.final %>% 
  mutate(y2_numeric = ifelse(late < minutes(2), 0, 1),
         y2 = ifelse(late < minutes(2), "no", "yes"),
         lateM = late/60,
         hour= hour(intervalhour),
         daynumeric= wday(intervalhour),
         week = week(intervalhour),
         weekday = weekdays(intervalhour),
         y5_numeric = ifelse(late < minutes(5), 0, 1),
         y5 = ifelse(late < minutes(5), "no", "yes"),
         week_numeric = as.numeric(week))

panelnoGEOM <- panelnoGEOM %>%
  mutate(STATION_ID = from) 

panelGEOM <- left_join(panelnoGEOM, stoplocations) %>%
  st_as_sf(crs =3424, agr = "constant")



panelnoGEOM[c("onestoplag", "twostoplag", "onestopearlieron", "twostopearlieron")][is.na(panelnoGEOM[c("onestoplag", "twostoplag", "onestopearlieron", "twostopearlieron")])] <- 0


set.seed(2121)
panelnoGEOMTrain <- filter(panelnoGEOM, week < 9) #### BRI, YOU MUST CHANGE THIS BACK TO A 9 
panelnoGEOMTest <- filter(panelnoGEOM, week >= 9)









## Model ----

Model1<- glm(y2_numeric ~ .,
                        data=panelnoGEOMTrain %>% 
                          dplyr::select(y2_numeric,
                                        total_departures,
                                        #from,
                                        hour,
                                        line,
                                        #lag1week,
                                        weekday,
                                        #lag1day,
                                        #toPenn.binary, 
                                        #tuesWed.binary, 
                                        #friWed.binary,
                                        #toSigStop,
                                        onestopearlieron,
                                        twostopearlieron,
                                        onestoplag,
                                        #sqrtlag2Hours,
                                        sqrtlag3Hours,
                                        #sqrtlag4Hours,
                                        #sqrtlag1week,
                                        lag3Hourstemp),  
                        family="binomial" (link="logit"))









## Cross Validation ----

ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

##### Model 1-


Model1.vars <- colnames(panelnoGEOMTrain %>% 
                          dplyr::select(total_departures,
                                        #from,
                                        hour,
                                        line,
                                        #lag1week,
                                        weekday,
                                        #lag1day,
                                        #toPenn.binary, 
                                        #tuesWed.binary, 
                                        #friWed.binary,
                                        #toSigStop,
                                        onestoplag,
                                        onestopearlieron,
                                        twostopearlieron,
                                        #sqrtlag2Hours,
                                        sqrtlag3Hours,
                                        #sqrtlag4Hours,
                                        #sqrtlag1week,
                                        lag3Hourstemp))

ModelcvFit1 <- caret::train(y2 ~ .,
                        data=panelnoGEOMTrain %>% 
                          dplyr::select(all_of(Model1.vars), y2), #na.action = na.pass,
                        method="glm", family="binomial",
                        metric="ROC", trControl = ctrl)






## Predictions ----


ModelProbs1  <- data.frame(Outcome = as.factor(panelnoGEOMTest$y2_numeric),
                         Probs = predict(Model1, panelnoGEOMTest, type="response"))


ModelProbs1.thresholds <- 
  iterateThresholds(data=ModelProbs1, observedClass = Outcome, 
                    predictedProbs = Probs)

ModelProbs1.thresholds<-ModelProbs1.thresholds%>%arrange(Rate_TP,Rate_TN) ## .30

modeltestProbs1 <- 
  ModelProbs1 %>%
  mutate(predOutcome  = as.factor(ifelse(ModelProbs1$Probs > .25 , 1, 0)))


Modelauc1<-pROC::auc(ModelProbs1$Outcome, ModelProbs1$Probs) #0.9501

Cypress Marrs, Briana Cervantes

12/3/2021