0

This table lists many language codes for data sets, but I can't seem to find where the pair ze_zh or ze_en is defined:

None of the ISO code standards seem to list these pairs either.

Sir Cornflakes
  • 30,154
  • 3
  • 65
  • 128

1 Answers1

3

I think this means that some of their data comes from a bilingual Chinese / English source. zh means Chinese and en means English, so it's reasonable that they invented ze to mean a bilingual data source including Chinese and English.

Correct me if I'm wrong, but it looks like you got this table from the Open Subtitles Corpus? http://opus.nlpl.eu/OpenSubtitles-v2018.php

That points to the paper "OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles" by Lison and Tiedemman published at LREC 2016. http://www.lrec-conf.org/proceedings/lrec2016/pdf/947_Paper.pdf

The paper doesn't use language codes, but it has this quote:

The dataset also includes bilingual Chinese-English subtitles, which are subtitles displaying two languages at once, one per line (Zhang et al., 2014). These bilingual subtitles are split in their two constituent languages during the conversion.

I don't know why they don't just label these as en and zh, but once you look into it more, I'm sure it will make sense.

Jetpack
  • 226
  • 1
  • 6