What are the available tools for managing crowdsourced data-cleaning tasks?

Question

The Zooniverse team has had amazing success with the OldWeather project to crowdsource coding of old ships logs to extract climate data. And I believe Sunlight did some crowdsourcing (with Mechanical Turk?) to build the Political Party Time database. Are there other services, libraries, etc., that can be used to support crowdsourced volunteers in performing similar types of tasks in open data applications?

I'm imagining tasks like checking a certain percentage of scraped data against the original source, or doing data entry when some documents are in non-machine-readable PDFs. At a minimum I'm assuming a library or service would be able to deal with the user registration and management, setting up task overlap for accuracy checking and user scoring, etc.

score 4 · Accepted Answer · answered May 09 '13 at 15:35

4

PyBossa is an open platform for crowd-sourcing. You write a bit of HTML/JS that is the microtask. They have examples including PDF transcribing. Features include user registration, user credits, statistics.

http://crowdcrafting.org/about

answered May 09 '13 at 15:35

D Read

2,361
2
16
22

score 1 · Answer 2 · answered Mar 05 '14 at 12:00

Amazon Mechanical Turk is pretty well setup for these kinds of tasks. Of course, you have to pay workers to complete the tasks, but there are plenty of open-source clients to access the service. You might especially want to look at boto for Python and MTurkR for R (note: I am the developer of this package).

cyclondude · Answer 3 · 2014-03-05T00:40:07.943

1

If the data is in tables within pdfs, I've created a script in R for splitting the tables into cells and using OCR. You could use the cell images the script creates and use pybossa or mechanical turk to crowdsource each cell and merge them all back into a table.

https://github.com/hansthompson/pdfHarvester

edited Mar 05 '14 at 00:40

answered May 09 '13 at 20:27

cyclondude

266
2
12

What are the available tools for managing crowdsourced data-cleaning tasks?

3 Answers3