26

I came upon an excellent graphical representation of the linguistic distance between a number of European languages. I'm looking for a similar worldwide map of currently spoken languages. Or at least the raw data. Even if it's not truly worldwide, but just wider than the one presented here, I'd still be interested in it.

enter image description here

Fiksdal
  • 858
  • 2
  • 9
  • 18
  • 4
    You use the term 'linguistic difference' but the legend on the graphic says 'lexical distance'. These would seem to imply very different things, why the difference in terms? – Gaston Ümlaut Apr 22 '16 at 23:24
  • 3
    Ok, judging by comments at Etymologikon and LanguageHat it's clear this map does not represent 'linguistic distance' but 'lexical distance'. It's also not clear what list/s it's based on, perhaps Swadesh? It also seems not to distinguish borrowings from inheritances so it simply shows present-day lexical similarity. – Gaston Ümlaut Apr 22 '16 at 23:41
  • @GastonÜmlaut Both concepts are new to me. Could you explain the diffrrence between the two? – Fiksdal Apr 23 '16 at 02:24
  • I don't know what 'linguistic distance' would be, but from the provided answer it seems its being assumed to be some measure of overall distance between two languages in terms of their lexical inventory, their morphology, syntax, etc. Anyway, the graphic doesn't use the term 'linguistic distance'. – Gaston Ümlaut Apr 23 '16 at 23:58
  • 2
    While 'lexical distance' is also a term I hadn't heard before, it's fairly clear what it means. The source of that map indicates it's comparing lists of words from two (or more) languages and counting how many are similar forms with similar meanings. If so it includes some cognates (i.e. those whose form and meaning are still similar), borrowings, false cognates (i.e. coincidences), etc so it seems a pretty useless measure altho it might have relevance for language learning. If someone could read Tyshchenko's article (in Russian) they might be able to clear this up. – Gaston Ümlaut Apr 24 '16 at 00:13
  • 1
    Is there any chance of updating these with Cornish and Manx on the map? I understand that they may not have many speakers but if you're putting Asturian on there, at least pay heed to the two living Celtic languages that have been missed out; just a couple of tiny dots would be enough... – Dewy Jun 12 '21 at 13:31

4 Answers4

24

@Fiksdal, I am the author of this of this version,

https://alternativetransport.files.wordpress.com/2015/05/lexical-distance-among-the-languages-of-europe-2-1-mid-size.png

which is based off of Tyschenko's work, see here

Since translating Tyschenko's map, I have spent a lot of time trying to figure out how the original list was made. Not much of Tyschenkos work is online and the best content is in Ukrainian or Russian. I also tried reaching out to the University in Kiev and Linguistics Department there, with no success. And I have limited access to any paper copies of his research.

But I think I have a pretty good idea of how the data was obtained. These maps basically show the Levenshtein distances lexical distance or something similar for a list of common words. Now this list could be the Swadesh № 100 or № 207 list with counting duplicate letter shifts in different words as one LD, or it could be Dolgopolsky № 15 list or a Swadesh–Yakhontov № 35 list and just brutally counting Levenshtein LDs on those lists. Or Tyschenko's could have his own list of words and methods to calculate the lexical distance. In one paper a master matrix is described with all of the lexical distance calculations and that each where calculated, but which exact method and a list of which words is not included.

I experimented with a couple of lists and methods my self and I come fairly close to the original matrix. Working on programming it automatically (if you have same word list from different languages) so it gives you a matrix over several languages and this is a test on Germanic languages.

|En|Sc|Du|Af|Ls|Li|Wf|Sf|Nf|Lu|Ge|Yi|Da|Sw|Fa|Ic|Nb|Nn|Sr
|00|13|29|30|30|28|26|26|32|40|33|34|39|37|38|42|38|36|37 English
|13|00|33|31|34|31|26|30|33|40|36|39|42|37|42|44|41|42|36 Scotts
|29|33|00|07|15|15|28|26|25|35|29|33|38|33|37|37|39|37|41 Dutch
|30|31|07|00|17|19|30|25|23|34|31|34|34|31|36|37|35|33|40 Afrikaans
|30|34|15|17|00|22|30|24|21|36|24|30|38|34|37|37|37|32|40 Low Saxon
|28|31|15|19|22|00|28|31|30|33|28|34|37|34|40|39|38|38|44 Limburgs
|26|26|28|30|30|28|00|26|27|39|33|38|42|37|39|43|42|41|40 West Frisian
|26|30|26|25|24|31|26|00|29|41|32|32|40|37|43|45|39|36|43 Saterland Frisian
|32|33|25|23|21|30|27|29|00|41|36|37|37|36|38|37|37|34|39 North Frisian
|40|40|35|34|36|33|39|41|41|00|27|40|45|42|50|51|45|45|51 Luxembourgish
|33|36|29|31|24|28|33|32|36|27|00|30|39|34|42|40|38|37|45 German
|34|39|33|34|30|34|38|32|37|40|30|00|38|34|36|36|36|36|42 Yiddish
|39|42|38|34|38|37|42|40|37|45|39|38|00|19|24|27|04|18|42 Danish
|37|37|33|31|34|34|37|37|36|42|34|34|19|00|28|28|20|19|39 Swedish
|38|42|37|36|37|40|39|43|38|50|42|36|24|28|00|10|21|20|44 Faroese
|42|44|37|37|37|39|43|45|37|51|40|36|27|28|10|00|25|20|43 Icelandic
|38|41|39|35|37|38|42|39|37|45|38|36|04|20|21|25|00|15|41 Norwegian (bokmål)
|36|42|37|33|32|38|41|36|34|45|37|36|18|19|20|20|15|00|38 Norwegian (nynorsk)
|37|36|41|40|40|44|40|43|39|51|45|42|42|39|44|43|41|38|00 Sranan

just to compare Swedish, I get 19 (Tyschenko got 21) to Danish, 28 (26) to Icelandic, 19 (16) to Norwegian (bokmål)

by German I get, 33 (49) to English, 29 (25) to Dutch, 30 (41) to Danish

My method and list does not match up to Tyschenko's but at times I am pretty close. So, yeah a worldwide version of this is possible, have to have the raw data from the languages (word lists) and I have to define a method and work on a program that could analyse the lists (and then go through the results to check them manually)

curiousdannii
  • 6,193
  • 5
  • 26
  • 48
  • 1
    thanks, will add the links I wanted to. Work progress updates I will post on the blog. – Alternative Transport Aug 27 '16 at 23:23
  • AlternativeTransport Great! :) – Fiksdal Aug 27 '16 at 23:24
  • 1
    Hello @AlternativeTransport and welcome to Linguistics.SE! Great to see you here. – Be Brave Be Like Ukraine Aug 28 '16 at 04:49
  • 1
    @AlternativeTransport Very interesting, could you post a link to your code and sources of words? – Matthias Schreiber Aug 29 '16 at 07:07
  • @Matthias Schreiber, well the original intention was to add some missing languages to Tyschenkos work (Luxembourgish, Bavarian, Allemanic, Turkic languages, Semitic languages), so I was trying to get Tyschenkos exact method and word list. Right now I am fiddling around with different methods and word lists. – Alternative Transport Aug 29 '16 at 17:01
  • 1
    To Methods: I don't really think that Levenshtein Distance does lexical languages distances justice. Lets say you determine the lexical distance from PHOTOGRAPH to FOTOGRAF. According to Levenshtein the lexical distance is 4, so 2 times P --> F and 2 times removing H. But you could argue that since it is the same replacement and change it could be 2. Or even only 1 lexical/linguistic distance. A different 10 letter word with 4 different changes is much harder to understand when reading. Or with Umlaut ÜÖÄÅÜ, should it really count as 2 LD, AE --> Ä? – Alternative Transport Aug 29 '16 at 17:16
  • To Word Lists: Lots of Synanoyms, should belch, throw up, puke or vomit be on the English Swadesh list? Should braken, overgeven, kotsen, spugen or blazen be included by Dutch? Erbrechen, speiben, kotzen or übergeben by German? What do I do with Cognates whos meanings have changed, Meer and Sea, See and Lake? The lexical distance between אָזערע (ozere) and sea? – Alternative Transport Aug 29 '16 at 17:24
  • 1
    So, to code and sources of words, got nothing finished yet, but I intend to share it so that hopefully other people from other languages can DIY their own lexical distance. – Alternative Transport Aug 29 '16 at 17:26
  • Well, I would really like that. I think I would loose it out of my focus, so can you write me an email at mat@boar.bar? Would be great! :) – Matthias Schreiber Aug 30 '16 at 06:08
  • This is very fascinating work and a further project to create a similar worldwide map could easily win grants if the application was well formulated. I hope you and other linguists manage to continue this work and come up with such a map. Regarding the program to use, R has several packages for calculating lexical distances (e.g. https://cran.r-project.org/web/packages/stringdist/index.html). – Mikko Mar 12 '17 at 08:54
  • Thank you for the tips and the compliment. This visualization is something I am doing in my spare time, it would be nice to have a team and funding. Is there a package in R that can find a smallest edit distance but with edit cost weighed differently, e.g. č to cz cost only 0.2 LD, th to d costs 0.4 LD and p to k costs 1.2 LD? By looking up a reference matrix with all possible replacements. Plus, I am looking to only count an edit for a word once and not again in other words in a list. – Alternative Transport Mar 13 '17 at 08:29
  • This map (at least the romance part) feels very wrong. I'm a native French speaker, and I live in a Catalan-spoken place. The 2 languages are very close: I can read the news and understand at least the general meaning. The wikipedia page about the language says that it's closer to French than Spanish. Also, a sister language of Catalan is Occitan, and it doesn't appear that way from the map. Also, the Catalan being marked as the closest language from Spanish is plain wrong. Also, French being closer from Italian than Catalan or even Spanish feels wrong as well; etc. etc. – Boiethios Nov 27 '23 at 09:55
  • Would be terrific to have the same map but with last common ancestor present in the graph (on background?), like Proto-Slavic, Latin, Proto-Germanic etc, to visualise degree of divergence from source – x3qt Jan 11 '24 at 14:46
13

As noticed in this answer, Prof. Tyshchenko's work primarily targeted languages spoken in Europe, hence, most of them belong to the Indo-European family; the only exceptions being Basque, Finnish, Estonian, and Hungarian (labelled as Madyarska). Even Celtic languages form a mini-group of four.

Linguistic Distances measured
Image courtesy of. I apologize the's no English version of the map with numbers.

Although I didn't find any direct mention why it does not cover other language families (LF), I think, we may assume that Linguistic Distances (LD) of languages belonging to different LF may be too hard to measure. This is probably caused by too different linguistic tools (mind the vowels, consonants, lexical tones, intonation patterns, and also a whole set of syntactic tools).

If you check the Wikipedia article on mutually intelligible languages, you may notice that most language pairs having short LD do belong the same LF.

Also, note that Prof. Tyshchenko's work (link; sorry for it's in Ukrainian only) is not an independent research. Instead, it's rather a result of long hard work of a big group of linguists who elaborated various methodologies of how to measure the LD. And it resulted only a list of measured LD between the languages spoken in Europe.


So I think that the suggested methodology can't be automatically applied to all world's languages, hence there's probably no worldwide map you're asking for.

Tristan
  • 8,262
  • 25
  • 46
Be Brave Be Like Ukraine
  • 8,629
  • 9
  • 38
  • 66
  • Yeah, you would probably need people with an in depth knowledge of each language group to do the research. Since these researchers knew Indo-European languages, they were equipped to work on that. I agree that it's unlikely that something like this exists for worldwide. – Fiksdal Apr 22 '16 at 19:09
  • 1
    I would also glad to see further elaboration of this work. If such a research about worldwide list of languages appear sometime, I would also be very curious to look at it (and stand corrected with my current assumption). – Be Brave Be Like Ukraine Apr 22 '16 at 19:26
  • 1
    Yes, it would be very cool if it existed. I'm sure it could be feasible to create it if the right researchers and experts got together. – Fiksdal Apr 22 '16 at 19:28
  • So does Tyshchenko's article really talk about 'linguistic distance'? Or about 'lexical distance'? – Gaston Ümlaut Apr 24 '16 at 00:28
  • @GastonÜmlaut, I think, it's about Linguistic Distance, since it's a property of a pair of Languages. Lexical Distance seems to be a property of a pair of Lexemes. – Be Brave Be Like Ukraine Apr 24 '16 at 01:43
  • Ok, but every other source I've looked at that discusses the Tyschenko article (not sure if any of them were able to read the original) uses the term 'lexical distance'. And it does seem that Tyschenkm's work is based on a comparison of lexical inventories, so there's no morphology, syntax etc. Sometimes this is called 'lexical similarity'. – Gaston Ümlaut Apr 25 '16 at 00:07
  • While I doubt that you could meaningfully construct a map for all worldwide languages, I would be fascinated to see a similar chart covering all the Indo-European languages. It's a pity those charts don't show any Indic languages because they are equally fascinating. – CJ Dennis Apr 25 '16 at 02:58
  • It's obvious that we need a new classification for Slavic languages, Instead of a classical one, where Serbo-Croatian is in the Southern group just because it's geographically at the South (science, yeah?) and where Central ones are in Eastern group just because Russian diverged later from the Center (politics, yeah?). It seems that based on this calculation slavic languages classification should look the following way: – x3qt Jan 30 '23 at 14:47
  • slavic
    • central (Ruthenian †)
      • north-central (Belarusan)
      • south-central (Ukrainian)
    • western
      • north-western (Polish)
      • central-western (Czech-Slovak)
      • south-western (Serbo-Croatian)
    • southern (Church Slavonic †)
      • southern (Bulgarian)
    • eastern (Muscovite †)
      • central-eastern (Central Russian)
      • south-eastern (Southern Russian)
    • northern (Novgorodian †)
      • assimilated into north-eastern (Northern Russian)
  • – x3qt Jan 30 '23 at 14:47