6

I'd like to train a system that takes text and predicts IAB classes.

Are there any public datasets available for this?

fgregg
  • 5,108
  • 16
  • 37
Jim K.
  • 163
  • 4
  • I am planning to create a dataset for this and make it public - can you let me know your requirements if you still have them 4 years on :) – dendog Mar 05 '21 at 21:03
  • Awesome! I would still welcome a dataset for this. I think ideally it would be a a multi-label dataset and (in a perfect world) would be for paragraph-sized text or smaller since aggregating up would be easier than detangling down. As you get larger chunks the odds of needing multi-label would seem to increase too. Good luck and let me know if there's a repo or something to track. – Jim K. Mar 09 '21 at 11:26
  • @dendog Is your dataset available? Thanks a lot! – Nicolas Raoul Mar 14 '24 at 04:13
  • Sorry I never got round to this, plus I feel now with LLMs and synth data - this is sort of solved. – dendog Mar 24 '24 at 17:03

2 Answers2

2

I was looking for that as well and did not find any so far. But this paper describes approach to use Wikipedia data for this.

Andrey
  • 21
  • 2
1

I think this is what you are looking for

https://www.kaggle.com/datasets/bpmtips/websiteiabcategorization

if you want to validate your model you can use this free api to validate results of your model

https://front-page.com/domain.php?domain=google.com

you just need to provide attribution to https://front-page.com

Bpm Tips
  • 26
  • 2
  • Also take a look at https://commonscreens.com, they have the already categorized data of 77 million domains and provide a training data set also. – Bpm Tips Jan 18 '23 at 17:18