Can I train a language model using Vicuna-generated text?

Question

Vicuna is an LLM. It is LLAMA fine-tuned on ShareGPT chats. Hence, it is trained on OpenAI-generated data. The LLAMA license prohibits commercial use. OpenAI terms of use prohibit using OpenAI-generated data to create competing models. ShareGPT data is self-disclosed.

Now, I want to use Vicuna to generate text that I want to use to train a language model. Does this model also have to be non-commercial?

In other words, the Vicuna weights are "tainted" with non-commercial Meta's and OpenAI's licenses. However, if I train my language model from scratch on open-source datasets and some Vicuna-generated text, will it also be "tainted", or not - because its training data doesn't contain any original OpenAI data and its weights are not fine-tuned from LLAMA?

@PhilipKendall But OpenAI explicitly says that 2. (c) You may not (iii) use output from the Services to develop models that compete with OpenAI. So this clearly states that Vicuna itself must be non-commercial, right? I just want to know if they have right to say this - is it a "law" or is it like saying something like... non-Americans can't look at the moon 'cause there's a flag there. — janekb04, Apr 26 '23 at 09:02
OpenAI can probably assert control over the exact output of their services. If you feed them into your own LLM and use the output of that, who knows? After all, what OpenAI have done is taken large amounts of copyrighted material and fed them into their LLM... — Philip Kendall, Apr 26 '23 at 09:05
@janekb04, No, it does not clearly state that Vicuna must be non-commercial. As I read it, it states that you are not allowed to train a LLM with on the output from OpenAI, regardless of if you offer your LLM for money or for free. If that is the same interpretation that a judge applies, then Vicuna is already in violation of the OpenAI ToU.. — Bart van Ingen Schenau, Apr 26 '23 at 15:01

Can I train a language model using Vicuna-generated text?

0 Answers0