Key Terms & Descriptions for Material Developers at Esukhia

dirk · August 15, 2021, 10:18am

This is a document that introduces key terms for MD Objects (material development) at Esukhia.

MD OBJECTS

There are 3 types of MD objects (“things” that have “content” or “data” – while the content might change due to edits, these are otherwise stable objects, as opposed to “processes”):

The Material
The Frequency Lists
The Corpuses

1) The Material

The Material is developed by the MD team. They take the CEFR goal and adapt it to Tibetan and Esukhia’s 6-step pedagogy. During development, feedback is provided based on frequency analysis – a process that uses the lists in combination with Dakje (to quickly find non-level vocab) and Antconc (to check for naturalness and cross-check for frequency).

However, the materials are not the same thing as the list; sometimes, the MD developers think a communicative goal (from CEFR) or a pedagogical goal (from the 6-steps) outweighs strict adherence to the lists. Or, a lesson or exercise might be an “MVP” version that has not yet taken frequency into account – that is, it has yet to go through proofreading or editing.

So, the inclusion of a word in a lesson’s material does not necessarily mean it is on the list, or that it appears in the larger corpus collection. (For example, just because a word is used in the A1 material, it doesn’t mean that it’s on the A1 list; also, just because the word is on the A1 list, it doesn’t guarantee it comes in the material – although, that should be a goal, especially for extensive reading materials and other supplements to the main textbooks / missions).

So, while the material takes the frequency lists into consideration, they are not perfectly reflected in each other. The A1 material and the A1 frequency list are two different things. Material developers should bear this in mind while developing level-appropriate material, being careful to edit as close to the level lists as possible (95%–100%), but knowing that one or two carefully considered words are okay to use (esp. for communicative goals and keeping things as natural as possible).

Esukhia materials currently come on 3 formats, accessible on 3 different platforms (although, all via one kind of software, a web browser): 1) Arapatsa (A2+, B1, B2; EdX); 2) Jongdeb (pre-A1; HTML); and 3) Mission Plans (A1 & A2; Google Docs).

2. The Frequency Lists

The freqeuncy lists are lists of words that represent the general number and kind of words a learner should be able to use and understand at that level. They aren’t a strict list — a student doesn’t have to know each and every word there, and they might know some that aren’t there — but they are a guide, or a representation of the set of words that are frequent and useful at that level.

As mentioned above, the lists are not the material; they are primarily based on the corpuses Esukhia has collected over the years. For the early levels (A-series), this includes the ‘natural speech’ section of the Nanhai corpus (K-R-XXX) and the Children’s Speech Corpus. This is because speech, and oral language skills, are the foundational skills of language.

However, just as the lists do not use exactly the same words as the material, they are not exactly the same as the corpuses either. They are built from the corpuses, but they aren’t just a copy-paste of the top words across all corpora. They are a subsection, selected based on things like: a) their frequency in natural speech; b) their usefulness for the communicative goals of the level; c) their additional level-appropriate colocations or compounds; etc.

Because they are manually edited, they may count (and list) “words” that aren’t considered “words” by the corpuses. For example, a corpus may split མ་རེད་ད། into མ་ (negation) རེད་ (verb) ད (emphasis particle)། But the list considers this as one word, a (negated & emphasized) version of the headword “རེད”. This helps us identify a difference between which versions of a headword a pre-A1 learner might use and understand (eg, only རེད་ and མ་རེད་) vs an A1 or A2 learner (རེད་, མ་རེད་, རེད་བ་, མ་རེད་བ་, རེད་ད་, etc.).

While a process – like Dakje – might not yet recognize this content, it’s still important to document it (for the developers, course guides, & learners that do, and the future processes that might). And, it’s important to know that a tool like Dakje doesn’t split words based on our frequency word lists, nor based on the corpus word lists, but on its own internal Python rules. It’s up to the material developers to cross-check the data.

For example, Dakje will always split རྟག་པར་ (one word) as two words: རྟག་པ + ར་, even if it is included on our frequency lists. While རྟག་པར་ is a level A1 word, and རྟག་པ་ isn’t, this returns what looks like a non-level word that is actually a level-appropriate word. Dakje might also split a larger word (that is non-level) into its multiple parts (that are level-appropriate). Users will need to manually correct for this for the time being…

3. The Corpuses

The corpuses are large datasets of Tibetan words. There are two main corpuses: 1) The Nanhai Corpus, which has several subsections (some of which are speech, some of which are literary), and 2) The Children’s Speech Corpus. Both of these inform the frequency lists. However, the corpuses were word-spaced manually, by different teams of people, at different times.

As material developers, we cannot take the corpus data – or the frequency lists we generate from them – at their word. As an important part of creating lists and grading content, we must constantly cross-check this dataset to ensure we have accurate counts; to find multiple spellings, or multiple word-spacings; and the like. While Dakje is best-suited for cross-checking of material, Antconc is best-suited for cross-checking specific words (eg, for how they are used naturally; for what words they are used with, in what contexts; for spelling variation or spacing variation; and so on).