5

My plan is this:

  • Find source of data about when and where accidents occurred on the US 101
  • Find a source of data about traffic volume on the same road
  • Subset the data to include only accidents that occurred on the US101 between San Francisco and Palo Alto.
  • Divide accidents by traffic volume for as small of time intervals as I can get traffic volume data. For example, if I can get traffic volume per hour, that would be great, because then I can divide average # of accidents in that hour on a given day by the volume of traffic in that window, and then assume, for lack of a better idea, that every car has an equal chance of being involved. Maybe I can get some data about different risk levels by driver age or type of car, but I imagine the insurance companies hold that data and aren't likely to share.

Suggestions for sources of data are greatly appreciated. Even if I have to do something lousy like multiplying national risk per capita by traffic volume, that would be good enough for now, my main problem is getting the data. Any clever suggestions appreciated. (FYI: this is just for personal interest)

William Gunn
  • 151
  • 4

1 Answers1

2

I think a very important covariate would be the weather conditions. I would think that weather data in the region of travel would be easily available to correlate with the occurrence of accidents. Traffic volume and time of day are probably highly related. But time of day may also be connected to when the more accident prone are likely to be on the road (teenager drivers and elderly for example). Accident prone drivers on the road is not a covariate that you can collect but time of day may be a useful surrogate for it. Keep in mind for other important covariates that you cannot directly collect data on that there might be a useful surrogate covariate to use in the model.

  • Thanks. The FARS website is getting me some distance towards where I want to go, and they do collect a range of data about each crash, not including weather conditions, but it's limited to only crashes where there was a fatality. – William Gunn Jul 07 '12 at 19:28
  • (+1) Weather is definitely related (think of the difference in the # of accidents in the winter-time versus the summer-time). So is the condition of the road and the speed limit (in addition to the total amount of traffic, which is likely a non-linear relationship to the total number of accidents). I'd check state police to see if they have tallies of accidents for insurance claims (or the insurance companies themselves). Non-fatal accidents are a mixed bag though, I was surprised how many fender-benders occur in parking lots in some of the data I have worked with. – Andy W Jul 09 '12 at 01:21
  • @Andy: Summer-winter weather changes are probably not a big effect on this particular stretch of road. :) I would think construction might be another potentially big factor. As for parking-lot fender-benders, this doesn't surprise me as much. People tend to back out of parking spaces using (only) their rear-view mirrors. – cardinal Jul 09 '12 at 02:50
  • 101 is a US highway. You don't enter it from parking lots. – Michael R. Chernick Jul 09 '12 at 03:13
  • @cardinal and Michael good point, I was thinking out loud alittle too quickly. The only other thing I guess I would note is that it is easy to sit here and list off a dozen variables just based on personal experience are likely to related to the outcome. That doesn't take into consideration the time and effort needed to collect such variables. It is easy to make a impossible to fill list, as many of these data sources aren't readily available (and even if they are would likely take considerable time and effort to conduct a rigorous data analysis). – Andy W Jul 09 '12 at 11:59
  • @AndyW Yes That is why I was recommending looking for surrogate variables that you can collect data on. – Michael R. Chernick Jul 09 '12 at 12:50
  • 1
    When I last looked at detailed traffic data (I recall it was annual summaries by road segment throughout the state of Washington in the 1990's) I performed extensive testing for evidence of deviations from Poisson distributions, but failed to find anything significant. Strong influences of covariates ought to create non-Poisson, overdispersed data. The absence of such behavior suggests it may be rewarding to start any analysis with the data that are at hand, before going off and hunting down speculative covariates. – whuber Jul 09 '12 at 14:22
  • @whuber Interesting! The only time I worked with such data I profiled hot spot locations for accidents for the state police in a upstate rural New York County. I did not even look to see if those hotspot locations deviated from what might be expected given a poisson distribution. I guess if I had to do it over-again I might look into techniques for Ripley's K adjusted for street networks (although with only one road of interest that may be unnecessary). Although a related "trouble" is that geocoding locations on highways are difficult. – Andy W Jul 09 '12 at 17:07
  • @Andy: Such data should be available for large swaths of highways in the US from the US FHWA. Every few years, they collect survey data including GPS on many thousands of miles of highway. I wonder why they keep thinking the roads might change location. ;) – cardinal Jul 09 '12 at 20:25