What is better: train a model from scratch on your own data vs. fine-tune pretrained model?

Question

Problem: I am interested in building a Q&A engine on top of my private data. I am only interested in asking questions related to my data.

Options:

I train a model from scratch on my own data
I pick a pretrained large language model and fine-tune it on my data

With option 1, I don't expect to train a model with billions of parameters. I understand training from scratch is expensive and time-consuming but I am going to use a much smaller parameter model in that case. For sake of argument, quoting [1]

Falcon-7B and Falcon-40B have been trained on 1.5 trillion and 1 trillion tokens respectively,

so if I have 1 billion tokens in my dataset, I will train a model with 40M parameters.

Is there some objective answer / study done as to which option turns out better?

For completeness, option 2 is something I have tried but I did not get any decrease in my training loss i.e., the fine-tuning was a no-op. I used gpt2 pretrained model with 130M parameters or so and my dataset had about 600 training examples in it. # of tokens = 600 * 1024. any pointers whether this (i.e., fine-tuning made no difference) is to be expected and why would also be appreciated.

score 0 · Answer 1 · answered Jul 14 '23 at 17:40

I will start by answering questions in the following order, why you shouldn't train your own model, why fine-tuning didn't work, and then what you can do.

As you yourself pointed out, training your own model is expensive in every sense. Not just that, it also involves a lot of complications. Transformers have a simple architecture but they aren't easy to configure.
Fine-tuning didn't help because fine-tuning doesn't add new information to the transformer's 'vocabulary', it only teaches the transformer what kind of a sequence (e.g. text-to-keywords, question-to-answer, word-to-definition, etc.) to output based on the input sequence, but the output is still limited by the data that the transformer was trained on. While an English language transformer can be fine-tuned on an en-to-fr dataset, during inference it will most likely output gibberish as it doesn't inherently understand the abstractions in the French language and the abstractions that come into play when translating from English to French.
A good way will be to create an embedding space (of vectors of all your documents), compare it with the question embedding when required, and use the top n texts that get returned as context to a large language model for it to answer.

What is better: train a model from scratch on your own data vs. fine-tune pretrained model?

1 Answers1