4

I'm creating a web service for learning words and I need to not make a user to see each English word, for example if you know the word "table" you most likely know words "apple", "door", "hi" and so on. Why? Because they are common and appeared frequently. Also it works for singular and plural, if you know "car", you know "cars" as well, but for the word "mouse" it's not so obvious that you know "mice". There are many such cases, like special terms and so on, I just need a database with the following structure:

word1 word2 prediction

prediction is the value from 0 to 1 showing how likely it is that you know word2 if know word1.

Is there any free database or API? For now I'm looking only for English language, but it would be great if there are for others.

devalone
  • 157
  • 1
  • 1
    I don't know if this is on-topic or not because it seems to be about the development of software that assists language learning rather than the software itself – Anthony Pham Apr 04 '18 at 22:23
  • 2
    @AnthonyPham Yes, it's about resources required for a specific type of software, but I don't see how the required resource(s) would be used for a different purpose, so I would consider the question on topic. – Tsundoku Apr 05 '18 at 15:02
  • I think it is unlikely that you will find an API or database with this information, let alone a free one. In principle, one could analyze a number of written texts, from the basic reading level to highly technical scientific papers, and calculate "correlation coefficients" between words. In the simplest form, this could be a symmetric relationship, so prediction(word1,word2) = prediction(word2,word1). This kind of text analysis is big nowadays, but you're asking for a fairly specific algorithm and the only people with accurate answers to these questions are big tech companies. – Myridium Apr 06 '18 at 04:16
  • @ChristopheStrobbe - this resource could be used for marketing purposes, to appeal to cliques by using their lingo. This question is asking for big data, and in addition, how to process that data. Nothing specific to the actual learning process of English at all, except for an incidental use case. I believe it is off-topic, and belongs on the data science stack exchange. – Myridium Apr 07 '18 at 01:41
  • I found that Google keeps public datasets. The data you want might be there; I'm unsure. – Myridium Apr 07 '18 at 01:41
  • 1
    @Myridium The fact that the same resource can also be used for purposes unrelated to language learning does not make the question off topic. For example, spaced repetition can also be used for other purposes than language learning, but that does not make spaced repetition off topic. – Tsundoku Apr 07 '18 at 17:17
  • Sounds like a fascinating idea. It probably wouldn't be (conceptually) hard to build such a database. The hard part would be getting people to volunteer to populate it :) – Flimzy Apr 11 '18 at 17:27
  • words that are found together in a context: collocation. – Lambie Jun 20 '21 at 15:51
  • Check out "word2vec". It's what the Semantle word game uses to measure similarity of words, and I'm pretty there are free versions out there. – JonathanZ Nov 12 '22 at 22:09

1 Answers1

1

I think you can do that by using what in NLP called "word similary". The higher the word similarity score, the more likely that the learner, if knows one, knows the other.

You'd need some programming to do this. I think there are 2 good choices for this:

  1. Using the lib gensim as suggested by this answer.
  2. Using Spacy, following this doc.
peterhung
  • 11
  • 2
  • In language learning, words would be said to collocate: bed, chair, table, bookshelf or fridge, stove and kitchen table. None of those are similar they are all different but refer to a "home universe", for example. – Lambie Jul 16 '22 at 20:23
  • @Lambie This is what Word2Vec is for. Each word is associated with a vector. Words that are related will have close vectors, even if they sound differently. Maybe in the context of language learning, you should train it differently than for common NLP use cases, because mouse and mice are very close and we want them "not so related". – Thomas Mar 06 '24 at 12:19
  • In linguistics, words that are likely to be in the same context are said to collocate. And if you are "learning words" you would want mouse and mice to be very related.That program may say vector but the technical word is collocation. – Lambie Mar 06 '24 at 15:48