Questions tagged [dataset]

Requests for datasets are off-topic on this site. Use this tag for questions concerning creating, processing, or maintaining datasets.

Datasets are structured data files in any format, collected together with the documentation that explains their production or use.

1905 questions
29
votes
2 answers

What aspects of the "Iris" data set make it so successful as an example/teaching/test data set

The "Iris" dataset is probably familiar to most people here - it's one of the canonical test data sets and a go-to example dataset for everything from data visualization to machine learning. For example, everyone in this question ended up using it…
Fomite
  • 23,134
16
votes
6 answers

Where to find a large text corpus?

I am looking for large (>1000) text corpus to download. Preferably with world news or some kind of reports. I have only found one with patents. Any suggestions?
16
votes
4 answers

Free public interest data hosting?

I have hourly and daily temperature reports for many stations at http://data.barrycarter.info/ I encourage people to download it, but, at 6.6G, it uses up a lot of bandwidth. Is there a service that hosts "public interest" data for free? I know…
user1566
11
votes
2 answers

Where can I find datasets usefull for testing my own Machine Learning implementations?

I am currently trying to implement some Machine Learning algorithms on my own. Many of them have the nasty property of being hard to debug, some bugs don't cause the program to crash, but rather work not as intended and seem as the algorithms just…
sjm.majewski
  • 3,648
8
votes
1 answer

Watermarking data for datamining

I'm in a work group that analyzes medical data. Unfortunately there's a lot of distrust if measured data gets to a competitor or is manipulated. So I was wondering if there would be a way to "watermark" the measured data before it leaves the house…
bdecaf
  • 425
7
votes
1 answer

Impact of inverting grayscale values on mnist dataset

http://yann.lecun.com/exdb/mnist/ Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black). Is there a reason why the original mnist sets the background to a low value (0) and the foreground to the highest value (255)…
6
votes
5 answers

Free Dataset Resources?

Possible Duplicate: Locating freely available data samples Where can I find freely accessible data sources? I'm thinking of sites like http://www2.census.gov/census_2000/datasets/?
miku
  • 441
6
votes
1 answer

Best practices for documenting a data-science pipeline

Even though I try to keep it as simple as possible, the pipelines for some of my data science projects get rather complex. At some point it becomes necessary to document this pipeline so that someone can return to the project, easily understand the…
captain_ahab
  • 1,512
4
votes
3 answers

Where can I find good publicly available data that I could use to teach z-scores to my college students?

I am sick of using the examples in the book. Is there an easy place to find data for which z-score/percentile/normal distribution stuff would be easy to see?
drury
  • 303
4
votes
3 answers

Finding patterns in data

I am probably looking for a definition. Imagine we have 10 variables, but we are not interested in some kind of linear relation (nor quadratic or with any curve to it). What I would like is a way to find "clusters" , patterns or combinations…
4
votes
0 answers

Free database of historical events in database format?

Is there a free database of historical events that's in database format (ie, CSV or some other easily imported format)? I realize Wikipedia has extensive historical information, but it's not in database format. I also visited historymole.com,…
user1566
4
votes
6 answers

Detecting Numerical Trends

I have a list of numbers that, when plotted on a graph, clearly demonstrate trends such as rising upwards, dropping, repeating etc.. When a human sees the graph, they can easily make out what's happening. What I'm trying to do is achieve the same…
keyboardP
  • 143
3
votes
1 answer

What are some examples of public datasets that have randomized instruments?

Sometimes they ask questions in different orders, or use different prompts. Or datasets with instruments (with at least one variable randomized)? I would like to use at least one of them for my causal modelling course (Stat 566), whose syllabus is…
user4206
3
votes
0 answers

Covid-19 available resources

I don't even know if this is the best place to ask it but, as a statistics community think somebody here may know something. I am wondering what kind of databases and data are actually available regarding covid-19 infection, especially…
cccnrc
  • 217
3
votes
1 answer

Data analysis: describing graphs (vocabulary)

I have to describe graphs, but I'm lacking in vocabulary (English not being my mother tongue), I hope there wasn't a previous open topic. If not it may be useful for others. how would you call a bump in a curve, where endpoints are similar a seen…
bixoez
  • 51
1
2 3 4 5 6