2

My Motivations I'm trying to learn German and realized there's a confounding fact with the structure of German: every noun has a gender which seems unrelated to the noun itself in many cases.

Unlike languages such as English, each noun has a different definite article, depending on gender: der (masculine), die (feminine), and das (neuter). For example: das Mädchen ("the girl"), der Rock ("the skirt), die Hose ("the trousers/pants"). So, there seems to be no correlation between gender assignment of nouns and their meanings.

The Data I gathered up to 5000 German words with 3 columns (das, der, die) for each word with 1's and 0's. So, my data is already clustered with one hot encoding and I'm not trying to predict anything.

Why I'm here I am clueless on where to start, how to approach this problem as the concept of distance in clustering doesn't make sense to me in this setting. I can't think of a way to generate an understandable description of these clusters. The mixed data makes it impossible for me to think of some hard-coded metrics for evaluation.

So, my question is: I want to find some patterns, some characteristics of these words that made them fall in a specific cluster. I don't know if I'm making any sense but some people managed to find some patterns already (for example word endings, elongated long objects tend to be masculine etc., etc) and I believe ML/AI could do a way better job at this. Would it be possible for me to do something like this?

Some personal thoughts While I was doing some research (perhaps, naive), I realized the potential options are decision trees and cobweb algorithms. Also, I was thinking if I could just scrape a few images (say 5) for every word and try to run some image classification and see the intermediate NN's to see if any specific shapes support a specific object gender. In addition to that, I was wondering whether scraping the data of google n-gram viewers of these words could help in anyway. I couldn't think of a way to use NLP or its sub domains.

Alternatives If everything I just wrote sounds nonsensical, please suggest me a way to make visual representations of my dataframe (more like nodes and paths with images at nodes, one for each cluster) in Python so that I could make pictorial mind maps and try to by heart them.

The ultimate purpose is to make learning German simpler for myself and possibly for others

Edit 1 : This idea of who sets a gender actually makes me think that if I could approach this problem in this perspective, I might actually have a better chance of being successful with my problem statement. Basically, using genders my multiple languages and by using longer timeframes.

yash
  • 121
  • 2
  • It would help to know something about the grammatical gender vrom a languagelearning perspective, but that's out of scope for this Stack. Native speakers would have a hard explaining any rule and I don't know how we suck it up with the mothermilk--just that we do. Natives would advise that it has to be learned simply as pairs, but that's not entirely true if -chen for example always declines neuter (I see I'm not telling you anything new, you did some research). "The ultimate purpose is to make learning German simpler"--I'm not sure that matters to the question; at worst it looks ridic. xD – vectory Jun 08 '20 at 11:31
  • Have you read this? https://linguistics.stackexchange.com/questions/35863/word2vec-why-does-the-famous-equation-king-woman-man-%E2%89%83queen-hold?r=SearchResults Conceivably, a Neural Net could train on recognizing morphemes that positively correlate with gender. Ideally you get different dimensions from the same ending (that may have different gender, e.g. n. or fem.) depending on what's inbetween, without even supplying segmentation. In practice, I wouldn't know where to start either, haha lol wut. Maybe it would only show that is indeed mostly arbitrary, indeed, I reall dunno – vectory Jun 08 '20 at 11:40
  • 1
    By now you will have noticed that not only does every German noun have a gender, it also has a plural ending, which is different from one noun to another. Unlike English, there is quite a choice of plural endings in German. In fact, these correlate (a bit) with gender, which is good from your point of view. Unfortunately, while native speakers can assign gender consistently to nonsense words by their shape, this is not possible for learners, and the plural and gender of German nouns pretty much has to be learned along with the spelling and meaning. The pronunciation is rule-governed, at least. – jlawler Jun 08 '20 at 15:22
  • If the correlated qualities can be extracted from/retrieved with the noun (e.g. by network action, or like via local process or api call etc), then something can be made, but hard to say whether it'd be meaningful; like said above, you might simply find that it's too unpredictable. Manifold models might help you find abstract correlates; I'd probably start there. Unsupervised ai is limited only by your imagination (and computational ability), but there are often anthropocentric leaps in people's expectations (e.g. correlate 'feminine' has prior dependence on ability to distinguish femininity). – TheLoneDeranger Jun 08 '20 at 19:05
  • You could try to model the distribution by a conditional GAN for words conditioned on gender: one network generating fake words of a gender, the other trying to distinguish fake from real ones. – phipsgabler Jun 10 '20 at 10:15
  • This above is just a random idea, but the reason I came to it was that while native speakers themselves don't know how gender is assigned, you can learn something from the way loanwords are gendered by them: sometimes based on appearent morphology (die Sauna, der Computer), sometimes based on original gender (das Virus) or inferred patterns (der Virus, even if wrong), sometimes based on semantic similarity (der Star), etc. (See here.) – phipsgabler Jun 10 '20 at 10:18
  • @vectory hey, thank you for that reference, I did gain some amount of context, but like it was mentioned in those answers, word vectors don't really care about the innate meaning of the word but what matters mostly is the play between the two words or how the occurrence of those two words happens to be considering a bunch of sentences. I think word vectors might not be a good idea for my doubt because, perhaps, we would be chasing a horizon that doesn't exist as the language already follows a particular set of rules and we'd be reconstructing it in a vector form from a bunch of sentences. – yash Jun 10 '20 at 20:31
  • @jlawler what do you mean by >gender consistently to nonsense words by their shape, is it impossible for me learn the shape of words? I plan to spend a couple of years at least in learning German. – yash Jun 10 '20 at 20:33
  • 1
    One fact to show that grammatic gender can be conceptually detached from the word form are those cases where loan words assume the gender of the synonym, or what's logical; however, we also see that work its way back to agree with morphollogy where applicable (e.g. das > der Comput-er, later das Laptop > der Laptop(-Computer), but das Notebook (~das Buch)). In cases where the morphology plays no role, you can only hope that there's no incidental correlation, so as to be statisticly insignificant. Anyway, a comment in the linked thread pointed out that it works on morphemes, too. – vectory Jun 10 '20 at 20:34
  • I think what JLawler means is that gender is similar to prepositions (in English and elsewhere; point in case, why "in English"). The bulk of the work on stuff like this has been done, I guess, so automatic exploration might be rather useful to discover the cases without any apparent pattern, to then reason logically about these, or to find more examples. Whereas the basic patterns like "-ung" ~ female are found in any good dictionary (canoo.net, e.g.) – vectory Jun 10 '20 at 20:43
  • or for an overview, try wiktionary https://en.wiktionary.org/wiki/Category:German_suffixes (no guarantees of correctness). Also, there's a datascience.SE, but they might have high standards for question quality, idk. – vectory Jun 10 '20 at 20:47
  • @TheLoneDeranger I really can't figure out what is that you are trying to say. Please do eleborate if you can. – yash Jun 10 '20 at 20:55
  • @phipsgabler thank you for the reference, loan words(from where and at what point of time in history) do have a lot of influence on the gender I must admit. – yash Jun 10 '20 at 20:57
  • @vectory After going through all these comments, I kind of get a feeling that there was a lot of hypocrisy involved as the genders evolved in German language based on the time and place a new word was adopted into the language :/ – yash Jun 10 '20 at 21:10
  • hypocorism, maybe – vectory Jun 10 '20 at 21:15
  • @Freak: do a Google search on Zubin Koepke gender. Zubin and Köpke (he spells his name both ways) have done a lot of research on German gender recognition by native speakers. They did a number of experiments; the results were always that German native speakers agreed with one another, independently, on what gender a new word would be assigned, based more on the sound of the word than what it meant. – jlawler Jun 10 '20 at 22:50

0 Answers0