Problem: I am interested in building a Q&A engine on top of my private data. I am only interested in asking questions related to my data.
Options:
- I train a model from scratch on my own data
- I pick a pretrained large language model and fine-tune it on my data
With option 1, I don't expect to train a model with billions of parameters. I understand training from scratch is expensive and time-consuming but I am going to use a much smaller parameter model in that case. For sake of argument, quoting [1]
Falcon-7B and Falcon-40B have been trained on 1.5 trillion and 1 trillion tokens respectively,
so if I have 1 billion tokens in my dataset, I will train a model with 40M parameters.
Is there some objective answer / study done as to which option turns out better?
For completeness, option 2 is something I have tried but I did not get any decrease in my training loss i.e., the fine-tuning was a no-op. I used gpt2 pretrained model with 130M parameters or so and my dataset had about 600 training examples in it. # of tokens = 600 * 1024. any pointers whether this (i.e., fine-tuning made no difference) is to be expected and why would also be appreciated.