8

I'll be giving a (short) introductory course for Data Science within a couple of months.

One of my ideas was, instead of giving lectures on theoretic aspects of it (which of course are important), make people actually get their hands dirty with data and get the feel for what's like to work with data on a daily basis.

But for that I think I'd need a single large and interesting (open) data set for which we could go about all the phases of a data science project. Something like:

  1. Define the problem
  2. Get the data
  3. Explore it
  4. Prepare it
  5. Develop our model(s)
  6. Reiterate as necessary
  7. Deploy

With the same data set. Or maybe with two related data sets.

So, any ideas of what that data could be?

Patrick Hoefler
  • 5,790
  • 4
  • 31
  • 47
ivanmp
  • 121
  • 4
  • Are these students comming from a "substantive" discipline (e.g. sociology, medicine, ...) or are the students comming from more "technical" disciplines (e.g. statistics, ...) or many different disciplines? In the former case I would find an example that fits that discipline. In the latter two cases I would choose examples involving crime, sex, drugs, etc. It is cheap, but it sells... – Maarten Buis Feb 03 '15 at 11:05
  • @MaartenBuis great question! I probably should have mentioned that. They're coming from a Computer Science background. Mostly programmers and database administrators. – ivanmp Feb 03 '15 at 11:07
  • I'd put out calls to other departments ... you might be able to get someone from physics, biology, or other science to partner with you (and give you actual questions they're trying to answer). – Joe Feb 04 '15 at 17:58

3 Answers3

5

I would suggest using crime data per city. These datasets give daily as well as historical data, as well as location and time of day. When combined with other data (weather, census, employment, types of businesses, housing, etc) many correlations can be analyzed with different models being applied.

Below is a link to a blog that I wrote on this a while back ago:

http://www.opengeocode.org/articles/crime%20analysis.txt

Andrew - OpenGeoCode
  • 8,657
  • 17
  • 28
4

Your ideal dataset is local, relevant to their discipline (if they have one), and interesting in its own right. In the past I have used Melbourne's "urban forest" dataset of 70,000 trees.

If you have hardcore computer scientists, you'll want a big and challenging dataset, maybe something transport related, like movements of taxis, or public transport, or bike share schemes etc.

Steve Bennett
  • 850
  • 5
  • 12
  • 1
    https://nycopendata.socrata.com/ have loads of good datasets that, while they may not be local to ivanmp, should prove to be big and challenging. – Phil Feb 05 '15 at 11:10
3

It's hard to know your exact audience, but there is a great blog post with (by now) more than 100 interesting data sets - 100+ Interesting Data Sets for Statistics

"Data science" is pretty far-ranging, so you could do many different types of analysis:

philshem
  • 17,647
  • 7
  • 68
  • 170