11

I am currently trying to implement some Machine Learning algorithms on my own. Many of them have the nasty property of being hard to debug, some bugs don't cause the program to crash, but rather work not as intended and seem as the algorithms just gives weaker results.

I would like to have some way of increasing my confidence in the implementation, for example if I had some small datasets, with additional information "Algorithms X worked for Y iterations and had results Z on this dataset", that would be really helpful. Has anyone heard of such datasets?

sjm.majewski
  • 3,648
  • What research have you done in investigating this question? At first blush, one would think that the literature you are using to find these algorithms would be chock full of sample datasets. – whuber Aug 01 '12 at 15:25
  • 1
    Well, I know ML mostly from University course, Coursea, lecture videos on the internet and a few papers I have read on specific topics. I know there are lots of sample datasets everywhere, but I am looking for some with information how different ML algorithms performed on them, so I can validate my own implementations. – sjm.majewski Aug 01 '12 at 15:50
  • There was a good paper at ICML recently on the problem with standardized datasets - that it stops you from thinking too hard about real world problems and the messiness that real-world problems involve. Personally when I started using real-world data my skill as a practicioner blossomed. So while I would not discourage you from using things like UCI as a stepping-stone or a testing, keep the eye on the prize! – Patrick Caldon Aug 01 '12 at 22:55
  • 1
    You should specify what type of machine learning you are doing. Binary classification data sets are different from function approximation (regression) data sets. – Douglas Zare Aug 02 '12 at 07:12
  • http://stackoverflow.com/questions/3272806/good-source-for-machine-learning-datasets-in-computer-vision/15763420#15763420 – Abhishek Gupta Apr 02 '13 at 12:13
  • This question appears to be off-topic because it is about finding data sets – Peter Flom May 14 '14 at 17:23

2 Answers2

11

From the UC Irvine Machine Learning Repository:

We currently maintain 223 data sets as a service to the machine learning community. You may view all data sets through our searchable interface. Our old web site is still available, for those who prefer the old format. ... If you wish to donate a data set, please consult our donation policy. ... We have also set up a mirror site for the Repository.

Also, the following MIAS dataset has been widely used and studied:

When benchmarking an algorithm it is recommendable to use a standard test database (data set) for researchers to be able to directly compare the results. Most of the mammographic databases are not publicly available. The most easily accessed databases and therefore the most commonly used databases are the Mammographic Image Analysis Society (MIAS) database and the Digital Database for Screening Mammography (DDSM). Besides, there are currently few projects developing new mammographic image databases as well as several old projects.

whuber
  • 322,774
deepML
  • 331
5

The UCI repository mentioned by Bashar is probably the largest, nevertheless I wanted to add a couple of smaller collections I came across:

sebp
  • 2,097