Hacker News new | past | comments | ask | show | jobs | submit login
Uber NYC Trip Data from April to Sept. 2014 (github.com/fivethirtyeight)
73 points by brbcoding on Aug 28, 2015 | hide | past | favorite | 26 comments



What Social Good and Smart City projects could you produce with a dump of all of Uber's trip data in a given city?

I think it's fair to say we (Uber) didn't intend for this data to be made public, but it was produced under a Freedom of Information Law (FOIL) Request, and now it is out.

So I'm curious how we at Uber could help improve cities and society with anonymized dumps of this kind of data - if this was a formalized thing.

Let me know your thoughts here or directly (benm at the domain of uber)


fivethirtyeight mentions one obvious usage: providing it to city planners for say bus routing.

The other thing I think you could do that would be very useful for drivers is show them areas that there are typical long waits / longer rides right now. I know a lot of drivers do some planning, but they seem to be in the dark. I'm sure a 'where to uber' app on android / ios for drivers telling them when to get where to take advantage of surge pricing for customers that will want to get taken to a desirable spot would be VERY popular among drivers.


City planners already know the bus routes that would best serve the city. Unfortunately, there's no incentive to improve service or increase ridership since it already doesn't scale: for every 30 cents that bus riders contribute in fares, taxpayers put in one dollar collectively.


City planners do not neccessarily know the best bus routes. They know which busses are used more frequently, and if they have unified fare systems using tap on/tap off cards, they know which trips are most often taken by bus.

What they do not know is where new bus services might be useful. If Uber was able to show that there is a demand for transport between X and Y not currently served by busses (e.g.: late night riders are there but the busses aren't, or there isn't even a bus that services route between A and B).

Services like taxis and Uber would be invaluable for planning mass transit systems.


I just want to shout out to Remix (FKA Transitmix) from the YC W15 batch - http://getremix.com/

Some wicked-smart former Code-For-America Fellows working on this YC-backed project.


To expand on the bus route thing: City of Houston just completed a complete revamp of their routes. It would be interesting to see how much, if any, that impacted uber's use in areas that should in fact have better (more direct) service.

Second item is DUI rates before and after Uber comes to town. For that matter how much surge pricing happens during major events such as concerts, football games, etc where people may perceive they must drive. Combine that with congestion data and perhaps uber can propose a partnership with the local metro to pick people up outside of the deepest traffic area by doing a bus/rail hop first.

For instance Houston has light rail from our sporting venues to park and ride lots. If you can group people going to the same neighborhood it's possible to shift congestion and make the last mile cheaper for riders and more environmentally friendly to boot.

You also have major events where really no one should be driving but they are so data is gathered: hurricanes, ice storms, blizzards and so on. In Houston we have regular flooding and it would be interesting to see where reroutes happen due to deep water and it would be very useful to have drivers marking flooded spots so it can be shared back to other drivers via Waze/Google Maps and maybe even CoH fire and police.

I could easily see an autonomous car on the side of the road with an led bar in its back window or on its roof warning that water is 12" past this point. Many of the worst spots have machine vision readable flood gauges but even for those that don't it may be possible to gather surface elevation from the Google maps car since they have such accurate GPS, and that data doesn't change often.

In fact one could see Uber / Google subleasing a few of those cars to HPD just for the purpose of sending them out to gather data (including video) on flooded intersections so police could spend their attention elsewhere. And of course so they'll know if someone does flood themselves out anyway by driving past the car and then stalling. Autonomous cars already track object distance and speed of vehicles in their view.


It doesn't apply so much to the big cities but in smaller cities and towns, it could be helpful for infrastructure planning/improvements - street lights, drinking fountains, bus stops around common pickup and drop off locations. Parking spaces around common UberPool locations, etc.


One cause of traffic is TLC cars roaming the streets slowly looking for pickups. If Uber can show that their cars don't wander as much, that may show that Uber is helping traffic. But the current data set doesn't have that info, only your internal data would have it.


What would the dump provide that is not already found in the dataset from NYC?


I'm suggesting, hypothetically, such a dump would be an on-going release of data rather than a one-time and ad-hoc dump.

Hypothetically there may be more facets of data not provided to cities, etc.


Having only Lat/Long of pickup data is disappointing (would have liked the dropoff data too), but the date range can be correlated to the NYC Taxi data set, which exists for all of 2014.

That's enough for interesting visualizations and statistical analysis of comparisons between the two, although the original 538 article (http://fivethirtyeight.com/features/uber-is-serving-new-york... ) is pretty good. (for my own visualizations of the NYC Taxi dataset, see my blog post: http://minimaxir.com/2015/08/nyc-map/ )

The Aggregate_FHV_Data.xlsx contains data on Lyft as well. In September 2014, Lyft did 115,999 total pickups in NYC, while Uber did 1,028,136 pickups. (however, Lyft didn't have any activity in NYC until the end of July.)


I love your visualizations of the NYC Taxi dataset, minimaxir =]

I was just going to mention that if people were interested in exploring this type of data, there exists a Kaggle competition[1] for "Taxi Trajectory Prediction". The data contains the full GPS paths of the taxis and comes with pre-existing scripts you can run online, including a Python script for visualizing the city via taxi paths[2].

There's even a secondary challenge for predicting travel times which include visualizations of taxi travel speed, producing "veins" on the city streets proportional to the speed you can travel along them[3].

[1]: https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajec...

[2]: https://www.kaggle.com/mcwitt/pkdd-15-predict-taxi-service-t...

[3]: https://www.kaggle.com/wikunia/pkdd-15-taxi-trip-time-predic...


Why are you just a software QA engineer with your skills?


I'm not minimaxir but anyway... that's an odd way to put it. My guess is, he/she is not the kind of software QA engineer you might be accustomed to depending on the kinds of companies you've worked at. Think of automated custom static and dynamic code analysis, continuous integration, build systems, deployment monitoring, etc. Places like Google and Stripe (and many other technical organizations with good engineering culture) have very skilled people doing these things.

I'm a normal/infrastructure engineer who reluctantly ends up focusing on test systems every once in a while. I get all the tests running, from a single script, with all results collated and machine readable, get CI set up properly, make the system reliable enough to trust, etc. Because someone has to do it! And yes I've done this in organizations where there was a QA engineer or two, who did reasonably cheap semi-automated QA, but couldn't put together a whole system and make it reliable enough to run on its own. But there are definitely a few places where QA engineer is not the "lowest rank".


General comment about QA: A lot of people do QA because they're interested in the 'story' or the 'business case' and simply find that interesting. It is not a starter-developer job.


Let's just say I'm a really good Software QA Engineer. :P


What. the. actual. fuck.

Trips outside of NYC (aka the 'burbs) have the full street address listed. Those are addresses of people's single family homes, vs. the NYC addresses which tend to be multi-family dwellings.


On mobile so I can't open it at the moment. Does it show names too? If not, what's the problem with knowing which addresses took cabs?


It's an issue of privacy. The public shouldn't know who does or does not take cabs.


Simple interactive timechart to see the number of rides over time per company: http://www.jut.io/play#gist/henridf/403a7b2e5d52a979eb28

Uber had a pretty sharp uptick in early September (end of summer?) which Lyft didn't appear to have.


Mmm... there are some large areas that have no pickups in April: http://i.imgur.com/FRnH4zM.jpg



Using Google Fusion Tables can be a little misleading for this since you can't accurately illustrate density.


Not using Google Fusion Tables :)

https://www.amigocloud.com/api/v1/users/22/projects/3153/dat...

By the way, here is the data in many different formats (kml, shapefile, geojson, etc) in case you want to play with it in other software: https://www.amigocloud.com/data_share/8850bab3c62141278ac5c4...


Why would the TLC have this data? Is this just for cab pickups through the Uber app? If not, would those also be included in this dataset?


It's a lucky thing for a few that the Ashley Madison lat/long is only an approximation based on zip code, yes?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: