3

I am considering writing an open source software, which would help writers and editors in their work.

For this means, it would incorporate suggestions based on statistical analysis of English book corpora (e.g., frequency of word, or word compositions, so-called n-grams).

To this end, I would need to statistically analyse English books, this means:

  • getting the text into an electronic system
  • analysing the text
  • incorporating the results of analysis into a database

Think of the results like available on the lotrproject about the Lord of the Rings books.

Now, of course, getting the text into an electronic system is against most disclaimers in books about copyright. However, I would argue that I am using the text for research and non-profit purposes, which according to my view passes the first factor of the four-factor balancing test.

The second factor is a harder nut to crack, as these are of course the blood and essence of copyright: works of fiction and non-fiction. However, I would argue that I will destroy the original work, I would just store statistics about it in my system, and the input I am using just for research purposes, which I will not distribute afterwards, nor read myself; I will just use it in a software for the analysis.

About factor three... I would of course need to use the whole work as input, otherwise the statistical analysis would be incorrect.

Factor four: the effect on the market value is zero or positive. I would not distribute which works were the basis of the statistics, or maybe just a list of the books which were included. The database itself would not contain a direct reference (e.g., it just would state that "in young-adult books" the phrase "c'mon mate" comes forth with this-and-this frequency, not that "in the Harry Potter books it's used n-times"). A sub-question would be, whether I'm allowed to distribute word frequencies and additional statistics about the books themselves (again cf. the above linked lotrproject page for examples).

But, of course IANAL, that's why I'm asking this question here.

D. Kovács
  • 163
  • 4
  • Is it even established that a statistical analysis of a work is a derivative work? – phoog May 06 '18 at 19:02
  • 1
    phoog AFAIK it's not a derivative work, however, in order to create the analysis, I need to store the copy of the original work in an electronic system, which is, as written in the question, against most copyright disclaimers. – D. Kovács May 07 '18 at 04:12
  • As long as you don't distribute the works as part of the software, what copyright issue could there be? If a user of your software decides to copy the full text of the Lord of the Rings onto his computer and use it with your software, he is liable if that copying is an infringement. It is similar to users of audio editing software. If a user decides to record full audio tracks (e.g. from some other copyrighted source) and then republish those tracks, he is liable for that action, not the software. – Brandin May 07 '18 at 09:23
  • 1
    @Brandin: the problem is, that I have to copy the texts onto my machine (either downloading them, which in Europe is not illegal; or scanning+OCR / typing myself), in order to be able to create those statistics, which I would then distribute. So the questions are: 1) may I copy the texts in order to create such statistics 2) may I distribute these statistics, or are they a derived work, thus falling under copyright protection? – D. Kovács May 07 '18 at 10:29
  • See also Distributing machine learning models (e.g., word embeddings) based on non-sharable datasets In your case the 'dataset' corresponds to the copyrighted work (e.g. LOTR), and the 'model' corresponds to the statistics that you pulled from it. As for your copying of LOTR, you say that it is already permitted. Unless you are asking whether it really is permitted for you to copy it without permission for the purpose you stated. – Brandin May 07 '18 at 10:46
  • 1
    OK, so the database itself is not a derivative work. However, your assumption, that my copy of the work is cleared is incorrect. That's actually the question: if I use a copyrighted work in order to create such a database, does it fall under fair use, or is it a copyright violation to use it in such a way (without the explicit consent of the copyright holder). See my comment beforehand: "1) may I copy the texts in order to create such statistics" – D. Kovács May 07 '18 at 16:22
  • Your concerns are reasonable and I don't know how corpus linguistic studies get around this in principle. Google does something very like this, and some of it is in the clear because the older works are in the public domain, but it has taken a position that it can do so as fair use with newer works with mixed litigation success. Reading up on its litigation history in this area would get you on the right track. – ohwilleke May 08 '18 at 23:44
  • @ohwilleke Are you referring to the Google Books results which show copied portions of books? That seems quite different than distributing only "statistics" (e.g. the phrase "have a cow" and variants appeared so many times in books published on a given year, or in a particular book). – Brandin May 09 '18 at 09:54
  • @Brandin I think he was referring to n-grams: https://books.google.com/ngrams – D. Kovács May 09 '18 at 09:55
  • @D.Kovács Yes. I know that there has been litigation related to ngrams but don't know how it was resolved. – ohwilleke May 09 '18 at 14:51

0 Answers0