License to enforce open source on derivative work from AI training such as GitHub Copilot or OpenAI Codex?

Question

New technologies like GitHub Copilot or OpenAI Codex use public code to train their models to generate code.

Is there a license to enforce open source upon output trained from public code? Is there a new license in the works? I don't want to prohibit training.

I previously thought that current GPL compatible licensed code was enough but I've seen some cases of it being violated under the guise of fair use, with models outputting code without any attribution and relicensed in non-compatible ways.

AGPL added a requirement to close the server-side loophole, is there a license that covers the new "AI scraping/training loophole" or a new clause perhaps be added to account for AI-training? (i.e. if you train a model using the given code, your model must also allow others to download the source code/training set/model corresponding to the output)

the github terms of service include granting a licence to github that gives them certain permissions to use your content for the purposes of "the Service". I don't think the primary licence you put on your software can take away permissions from such a licence you give as part of an agreement with such ToS. — starball, Sep 05 '22 at 08:58

score 3 · Accepted Answer · answered Aug 11 '21 at 08:55

3

The issue here is that licenses by necessity derive their power from copyright law, so can apply only in situations where the output is legally a derivative work of the original.

It is by no means clear that in legal terms a model which is trained on a given set of data is a derivative work of that data - and if it's not, anyone is free to ignore the license on any publicly available data set when using it to train a model; it would be possible to restrict this sort of thing by contract, but that's getting into a very different area. The issue of data, models and derivative works will almost certainly be a topic which evolves over the next few years.

answered Aug 11 '21 at 08:55

Philip Kendall

19,156
1
57
82

I think it's fairly obvious that training a model makes the model itself a derivative work. It incorporates the work, and is thus derivative by default. That's the core principal. Deviations from that principal are the exception, not the rule, and thus the onus is on the party training the model to prove that they have some kind of exception. I'm not aware of any principal that makes training data an exception -- it isn't specifically for satire, it isn't specifically for education, and OpenAI et al haven't even claimed a fair use exception. – Laereom Mar 31 '23 at 10:20
1

@Laereom OpenAI et al absolutely 100% are claiming a fair use exception - see e.g. this article "In these cases, defendants have claimed that the use of copyrighted material in this manner constitutes fair use." – Philip Kendall Mar 31 '23 at 10:41
Well, I stand corrected. I hadn't looked a lot into the legal situation they were in at the moment. I also wasn't aware of the "transformative work" fair use, since it isn't in the textbook disclaimers and seems to have been something interpreted into being after the fact by a judge. It seems they have some amount of legal defense based on precedent, although it also seems to hinge on the actual or expected impact on the competition of the potentially-infringed parties and the outputs of the generative transformers. – Laereom Apr 01 '23 at 12:54

License to enforce open source on derivative work from AI training such as GitHub Copilot or OpenAI Codex?

1 Answers1