8

I'm doing a lot of research in the field of Digital Text Forensics (e.g., authorship attribution, authorship verification, or author profiling). In this field there are only very few data sets available which can be used for own research. One of the few important sources is the evaluation lab on uncovering plagiarism, authorship, and social software misuse (named PAN).

Over the years, I collected texts from many different internet sources (blogs, forums, Amazon, news portals, Project Gutenberg, etc.) on my own, which I aggregated and compiled into various data sets that could be very valuable for the Digital Text Forensics community. Since, as a researcher, I am very interested to share data in order to make experiments reproducible, I'm now facing a big problem, which I don't know how to handle...

How/Can I share my data sets, without worrying about copyright issues?

It should be also highlighted that the data cannot be anonymized, otherwise it would be totally useless for the research community. Hence, I don't know where to start. Of course there are platforms such as FigShare, where data sets can be published, but if I would do this I would first don't get any credit, and secondly can run in a lot of troubles for publishing texts which are not mine (even though they are publicly available).

Has someone else faced this problem?

tripleee
  • 113
  • 5
  • Is the number of ways in which the data is used somehow limited? You could offer some form of API access which allows people to use the data without giving full access. – sheß Aug 22 '15 at 12:18
  • What do you mean exactly by API access? In a restricted fashion via user/password for registred users? The data itself can be only used in its plaintext form, in order to provide everyone to possibility to extract their own, desired features from it... – NeuroMorphing Aug 22 '15 at 14:41
  • This sounds a little self contradictory. You can not share the data if you don't have the right to share it. – sheß Aug 23 '15 at 21:08
  • With API access I meant something like a standard API, or simply an online interface where people could type in some query/scripting language commands, upload data, and obtain results without actually looking at the data. This is the only way I see that people could use the data without obtaining it in full. – sheß Aug 23 '15 at 21:11
  • "self contradictory"? As I said, I want to share the data but I don't want to run into copyright-troubles. Thats why I'm asking if someone has a good idea how to overcome this problem. Regarding the API, I don't believe this is a good solution, if it would be an online-API (due to many lookups). But perhaps I will publish it anonymously to a plattform - this will not solve the copyright issue and I won't get any credit but at least other researchers can access the data in order to reproduce results. I never thought research can be so complicated... – NeuroMorphing Aug 23 '15 at 21:50
  • 2
    Sharing your corpus might be allowed under copyright law as "fair use."

    Check the Internet for "fair use". There is a lot of information. Consult an attorney if in doubt.

    Consider the risks of publishing the entire corpus. If the copyright owner of one of the documents complains, can you remove just that one document? If you don't charge for your corpus, it's unlikely anyone can argue they've been financially damaged by your work.

    – pndfam05 Aug 24 '15 at 20:35
  • @pndfam05: Thank you very, very much for the interesting hint !! I must confess that I've never heared about if before and at the first glance it looks pretty helpful. I will go into the matter and check if "fair use" also applies here in Germany. BTW: Of course I don't charge for the corpora. My only intention is to publish the data in order to improve the research field of Digital Text Forensics. – NeuroMorphing Aug 24 '15 at 23:23
  • 1
    @Unhandledexception : As it sounds like you might be in academia: if you're affiliated with a university, you might also want to check if there's an IRB (institutional review board) aka. ERB (ethical review board) that has to approve 'research', as they might have rules about data release. (and you could be fired if you don't go through them) – Joe Aug 26 '15 at 00:07

1 Answers1

3

Without being a trained copyright expert, the crux of your endevour lies in your process of aggregating your text corpus:

Over the years, I collected texts from many different internet sources (Blogs, Forums, Amazon, News portals, Project Gutenberg, etc.) by my own, which I aggregated and compiled into various data sets that could be very valuable for the Digital Text Forensics community.

Your problem

If you want to share the resulting data set with your name on it (and don't risk any trouble), every source you use must be published under a license which explicitely allows its reuse. That means, to be sure, you would have to read the terms/license statement of every blog you scrape. Forums, if they are older, usually don't enforce a permissive license on user contributions. Project Gutenberg usually has an explicit per-book license, and a general license information page. Their content looks rather fine to use. For commercial entities like Amazon, they probably don't have a license that allows reuse without explicit, written permission.

Practical recommendation

If your process of generating your derived data sets (the ones you want to share) from a set of given inputs is rather automated, you could recreate a reduced version that only relies on sources that were explicitely published under a permissive license.

Then, in the end, you must choose a license for your own publication. If your data set resembles a database, the ODbL might be a good choice. And just like you want to be attributed for your work, you should accompany your dataset with sufficient attribution for the creators of your sources.

Yes, this is all very cumbersome, but that's how copyright law is right now.

ojdo
  • 2,804
  • 14
  • 31