7

I plan to mine the mailing list archives of any open source software to answer interesting research questions.

How can I request for the data?
What is the procedure?

Are any small datasets of the mailing list archives available to perform a test run? If so where can I find one?

sboysel
  • 1,214
  • 8
  • 21
Hemaa mathavan
  • 315
  • 1
  • 10

3 Answers3

4

Here is a list of papers discussing the use of email in studying FLOSS (free, libre, and open source software) development.

(Disclosure, I run the flosshub site, and I wrote a few of those papers, including a survey of how FLOSS researchers have used email in the past, and what projects they've studied most)

I personally think that the MarkMail service is very handy for general studies of email. Here is a paper where we used Markmail to look at the use of pastebins over time on FLOSS projects. (Are they adopting pastebins as an innovation or not?) Markmail was a great source for that kind of "mile wide, inch deep" analysis where we were just counting words.

However for doing large text analysis, usually there is so much mail that most people don't use a service like MarkMail or Gmane, but rather they look for the original mbox files for the email. This is so that you can put them in a database and do your own cleaning.

Keep in mind that some projects like the LKML do not have the original mbox any more (a tragic loss, IMO) so you are stuck scraping your own data from various web sites that provide archives, such as lkml.org. Other projects, like Apache ones, DO have email archives where the mbox are available.

Keep in mind that if you are studying FLOSS, the messages themselves will be quite messy and very technical, full of source code, replies, and all kinds of garbage. Right now I am working on a data cleaning project with a 20-year set of email and it has taken nearly a full year for myself and a student to clean and process this mail properly.

Good luck!

Megan Squire
  • 471
  • 2
  • 7
  • Does FLOSSmole contain downloadable mailing list archives? – Hemaa mathavan Dec 04 '15 at 03:29
  • We have a few downloadable archives of email from the Tigris forge. However they are not cleaned and I've never used them for anything so I can't vouch for their utility. It was more of a "collect these in case we need them one day" sort of thing. http://flossdata.syr.edu/data/tig/ (Click up in that directory structure to find more fun files, mostly metadata about projects.) – Megan Squire Dec 07 '15 at 18:30
  • can you guide me....how do you clean your email content....what procedure do you follow.....i came across a paper at http://research.microsoft.com/en-us/people/hangli/tang-etal-kdd05.pdf. – Hemaa mathavan Dec 09 '15 at 10:40
  • can you direct me some articles which concentrate on email data cleaning – Hemaa mathavan Dec 09 '15 at 10:42
  • There is no one set procedure, since it really depends on what you are looking for. Ex: if you just need email headers that will be a completely different procedure than if you need source code from an email thread which will be completely different than if you need attachments. I wrote a book on "Clean Data" but it was not email specific :) If you need to talk more specifically (and confidentially) about the particular needs of the data you have, please email me! – Megan Squire Dec 09 '15 at 13:40
3

You could scrape some of the archives from Gmane.

Another source would be mail-archive.com.

You could search the web for specific pipermail archives.

There are likely more.

gerrit
  • 1,144
  • 5
  • 16
1

OSgeo.org hosts mailing list for their incubation projects and efforts: http://lists.osgeo.org/mailman/listinfo as well as for their general announcements: http://lists.osgeo.org/pipermail/announce/

user33290
  • 375
  • 1
  • 2