TM Editor Guidelines, Pt 1 (from 84000)

dirk · January 30, 2020, 7:38am

Instructions for Aligning TMs from Pre-Segmented Text Files Using InterText:
TM Standards
General Principle:
1. Placing Segment Breaks According to Inflected Verbs
2. Segmenting from the Perspective of the Tibetan’s Own Grammar
Editing the Tibetan Segmentation:
1. Pre-segmentation Performed by the Pybo-Script
2. Where to Merge (Correcting Breaks Made by the Script)
3. Where to Break (Making Additional Breaks Missed by the Script)
Editing the English Segmentation:
1. Changing the Sentence Order in the English
2. Separating Compounded English Segments
3. Punctuation
4. Conjunctions
5. Verses
6. Words or Phrases Omitted or Added within the English Translation
Flagging Problematic Segments:
1. Marking Errors
2. Alternative Sources
Judgment Calls
Cheat Sheet

1. Instructions for Aligning TMs from Pre-Segmented Text Files Using InterText:

To segment the texts we are using a convenient open source application called InterText. You may download InterText here; it is very light weight and it can be run locally on Windows, Mac OS, or Linux.

I have made the following screencast that should show you everything you need to know:

Additionally, there is an online PDF guide for using InterText, although the guide contains a lot of documentation that is not necessary to read through, to simply work from the app you should just need to read part II, chapters 7–9 here.

For 84000’s TM project, we will provide you with the two .txt files that you will be aligning from InterText. The two texts have been prepared with a script (for anyone applying this methodology to another TM project, documentation for these scripts may be found here.)

As mentioned in the tutorial, when you upload the .txt files use the following dialog settings: for the English, set the “Paragraphs” to be separated by “line breaks” and the “Sentences” to be “automatically segment text using profile: default”. For the Tibetan, set both “Paragraphs” and “Sentences” to be separated by “line breaks”.

Once you have become familiar with the interface, please read through and use the following guidelines while you are editing the alignment of the texts. These standards are summarized in a cheatsheet at the end.

2. TM Standards

The following is a set of recommended standards for segmenting translation memories according to what can be loosely be understood as the “sentence” or “complete thought” found in the source-Tibetan. The examples here all use English as the target language, although since these standards focus on the Tibetan grammar, it is hoped that this methodology may be adapted to be paired with any target-language.

When creating the segmentation, it is expected that there will be some degree of subjectivity onpart of the TM editor defining the length, start, and end of each segment. After surveying many different scenarios, source-genres, and translation styles, we have determined that it is necessary to leave these rules somewhat flexible. If the rules are too rigidly defined, there will inevitably be scenarios where too many of the TM segments will be too long to be of any use, and this is particularly the case with classical Tibetan texts that notoriously use frequent run-on phrases. However, if we can create some general principles and guidelines that can be agreed upon, then our TM resources will be much more useful, both for translators retrieving their own past translations for consistency and recall, as well allowing TMs to be archived and shared between translators.

i General Principle:

A “segment” of the text constitutes what can be understood as one complete thought. In principle, it should be the most basic phrase-level unit of text that can be correctly understood without relying on any grammatical modifiers that may happen to precede or follow the segment. A segment should include at least one primary clause containing (A) a subject, which may be actually stated or just implied, (B) a verb, and (C) any grammatical adjuncts connected with that verb (objects, prepositions, participles, dependent clauses, etc.). A segment may necessarily contain multiple clauses if those clauses are dependent upon another main clause to be correctly read. The segment should thus be the smallest span of text that the translator will conceptually translate as one unit. Here is a simple example:

བྲམ་ཟེའི་ ཁྱེའུ་ ཉེ་ རྒྱལ་ དབེན་པ་ ལ་ དགའ་ བས་ གནས་ མལ་ དགོན་པར་ སོང་ སྟེ །
As the young brahmin Upatiṣya enjoyed solitude, he had gone to live in the forest,

A good way of thinking about segments in terms of length, is to read the passage out loud, and see where one would naturally punctuate with a pause before going on to the next statement. These pauses will likely be where we want to break up the segments. Abstractly, this will be more or less what is equivalent to an English sentence; however, the segmentation needs to be done from the perspective of the Tibetan grammar itself and not from its English translation. Tibetan only defines a full stop grammatically with the རྫོགས་ཚིག་ or completion particle (e.g., -འོ་, བོ་, སོ་ etc…). Unfortunately these particles come too few and far in between to make useful segmentation, so we need to determine some additional breaking points that will break passages of one or more clauses into eloquent and suitable segments.

Placing Segment Breaks According to Inflected Verbs:

Because in Tibetan, the typical grammatical order is subject , object , verb , a single clause always ends in an inflected verb (inflected into the past, present, future, etc.). These verbs need to be the focal point for determining when a break should and should not occur between any two clauses. Therefore the process for segmenting the Tibetan generally involves:

1) reading through the text and stopping to inspect each inflected verb,
2) determining if the clause governed by that verb along with all of its own grammatical adjuncts can stand on its own and doesn’t depend on the clause following it to determine its meaning, or vice versa.

If the answer to 2) is yes, then a break should be added after that clause. Note that it will almost always include an additional particle following the inflected verb whether it be a case particle (གིས་, གི་, ནས་, ན་, -ས་, etc…) or non-case particle(s) (ཡང་, སྟེ་, དེ་, ན་ཡང་ etc…) which should also be included in the segment. This final particle will also need to be considered, as it will be an important factor for determining the relationship between the preceding clause and the one following it.

I avoid calling these segments “independent clauses” because the presence of this final particle following the inflected verb would actually, in most all cases, make the clause a dependent one if it the Tibetan particles were forced into the parameters of English grammar. But in principle, any one segment should be able to stand on its own; the segment should include all the essential adverbs, locatives, and other adjuncts that are directly associated with the action of the verb.

Here are a few examples of segments created using this methodology:

དེ་ནས་ བྱང་ཆུབ་ སེམས་དཔའ་ འཕགས་པ་ སྤྱན་རས་ གཟིགས་ དབང་ཕྱུག་ དང་ །_ བྱང་ཆུབ་ སེམས་དཔའ་ ལག་ ན་ རྡོ་རྗེ་ ཏིང་ངེ་འཛིན་ དེ་ལས་ ལངས་ ནས་
Then, the bodhisattvas Noble Avalokiteśvara and Vajrapāṇi emerged from their state of concentration, and

བཅོམ་ལྡན་ འདས་ ག་ལ་བ་ དེར་ སོང་ སྟེ་
went to where the Blessed One was staying.

ཕྱིན་པ་ དང་ །_ ལན་ གསུམ་ དུ་ བསྐོར་བ་ བྱས་ ཏེ
They approached, circumambulated him three times, and

བཅོམ་ལྡན་འདས་ ལ་ འདི་སྐད་ ཅེས་ གསོལ་ ཏོ །_།
said to him,

བཅོམ་ལྡན་འདས་ དེ་བཞིན་ གཤེགས་པ་ རྣམས་ ཀྱི་ ཐབས་མཁས་པ་ དང་སེམས་ ཅན་ རྣམས་ ཡོངས་སུ་ སྨིན་པ ར་ བགྱི་བ་ ནི་ མང་ ངོ་ །_།
“Blessed One, the Tathāgata’s skillful means and methods that bring beings to spiritual maturity are many.

བདེ་བ ར་ གཤེགས་པ་ མང་ ངོ་ །_།
Sugata, they are many indeed.

བཅོམ་ལྡན་འདས་ དེ་བཞིན་ གཤེགས་པ འི་ གཟི་བརྗིད་ དང་ རྫུ་འཕྲུལ་ གྱི་ མཐུ ས་ བྱང་ཆུབ་ སེམས་དཔའ་ སེམས་དཔའ་ ཆེན་པོ་ དང་ །_ ཉན་ཐོས་ ཆེན་པོ་ དང་ །_ ལྷ་ དང་ །_ ཀླུ་ དང་ །_ གནོད་སྦྱིན་ དང་ །_ དྲི་ཟ་ དང་ །_ ལྷ་མ་ ཡིན་ དང་ །_ ནམ་མཁའ་ ལྡིང་ [146b]དང་ །_ མི འམ་ ཅི་ དང་ །_ ལྟོ་འཕྱེ་ ཆེན་པོ་ དང་ །_ རྒྱལ་པོ་ དང་ །_ བློན་པོ་ དང་ །_ བྲམ་ཟེ་ དང་ །_ ཁྱིམ་བདག་ དང་ །_ དགེ་སློང་ དང་ །_ དགེ་སློང་ མ་ དང་ །_ དགེ་བསྙེན་ དང་ །_ དགེ་བསྙེན་མ་ རྣམས་ མང་ དུ་ འདུས་ སོ །_།
Blessed One, the bodhisattva mahāsattvas, the great śrāvakas, gods, nāgas, yakṣas, gandharvas, asuras, garuḍas, [146.b] kinnaras, mahoragas, kings, ministers, brahmins, householders, monks, nuns, and male and female lay vow holders have gathered here in great numbers through the strength of the Tathāgata’s majesty and supernatural powers.”

You may notice with these examples that there are some gray areas when determining when an adjectival clause is considered to be an adjunct to an adjacent clause. In such cases you should use your own judgment as to when a clause should be split. In such consideration the length of the clause should also be taken into consideration; an exceedingly long TM segment will be less useful as a resource. However, when a long list of nouns is the subject or object of a verb as in the last segment in the example above, then these need to be joined with their governing verb for the segment to be correctly understood, even if it is quite long.

Note that in almost all cases any final particles immediately following the last inflected verb should be included at the end of the segment. Also, if the segment ends on a double shad , “།_།”, the break should always occur after the second and final shad , as seen in the examples above. This follows the Tibetan rules of grammar, and we should never see a shad occuring at the beginning of a segment.

Segmenting from the Perspective of the Tibetan’s Own Grammar:

As mentioned, the segmentation of the TMs should be governed by the Tibetan grammar rather than the punctuation and grammar found in its English translation. This is because the TMs will be recalled in future translation projects and we want to make them universally applicable to any new translation project. If the Tibetan is segmented like this in a consistent way, then when a new text to be translated on a CAT platform like OmegaT is run through the script and segmented following this same methodology, the TM data will yield the optimal amount of matches.

Therefore, the process of editing the segments should disregard the periods, phrasing or other grammatical aspects found in the English translation. Of course, the English translation will often naturally match up with the Tibetan segmentation, but we want to avoid taking too many cues from the English when determining when and when not to break. That being said, reading the English will be helpful for understanding the Tibetan text and finding where its own “complete thoughts” start and end. Especially if your Tibetan reading skills are still developing, then following the Tibetan by reading along with the English will be essential for the TM editing process.

Giving preference to the Tibetan grammar does present an obvious challenge when the English translation has compounded two Tibetan segments together and intermingled the words in the English, which would prevent you from being able to make a clean break. In this situation there is a simple solution involving duplicating and bracketing the English; it is described in detail along with some examples in section 2.iii.b below, on editing the English segmentation.

ii Editing the Tibetan Segmentation:

As mentioned the Tibetan and English .txt files that you will be aligning will first be pre-segmented with a script. For the Tibetan, a script is used that first word-tokenizes all the words and particles according to parts of speech and then creates segments with single line breaks after certain infected verbs depending on the types of particles following them.

Ideally we want the script to do 50-70% of the work in terms of creating the breaks. A few of these breaks will need to be corrected and merged again either because it erroneously tokenized a verb or particle, or it broke at a conjunction that really needs to be joined with its following clause to be understood as a complete thought.

Then a few additional breaks will need to be added for conjunctions that are not broken by default, but do in fact mark the boundary of a complete thought.

Pre-segmentation Performed by the Pybo-Script:

We will continue to update the script as we go, but generally the script will identify the inflected verbs (i.e., not the nominalized ones containing the markers +པ་/བ་/པར་ etc…), and create a break under the following conditions:

A segment break will be made after an inflected verb followed by a completion particle (རྫོགས་ཚིག) followed by a shad ། ending.

ཐམས་ཅད་ ཀྱང་ ཆོས་ ཉན་པ ར་ འདོད་པ ར་ གྱུར་ ཏོ །_།

A segment break will be made after an inflected verb followed by a source case ནས་ or ལས་ particle.

དེ་ནས་ དེ འི་ ཚེ་ ན་ ས་ ཆེ ར་ གཡོས་པ ར་ གྱུར་ ནས་

A segment break will be made after an inflected verb followed by a continuative ལྷག་བཅས་ particle, སྟེ་, ཏེ་, or དེ་, (note that the last one may be a demonstrative pronoun that has been misidentified by the script).

དེ་ནས་ བཅོམ་ལྡན་འདས་ སེང་གེ འི་ ཁྲི་ དེ་ཉིད་ ལ་ བཞུགས་ ཏེ་

A segment break will be made after an inflected verb followed by a ན་ case particle that will typically mark a conditional/temporal clause.

མི་ ཅིག་ ནམ་མཁའ་ ལ་ མེ་ཏོག་ གཏོར་ ན་

A segment break will be made after an inflected verb that has a ། but no other particle following it.

ཡོངས་སུ་ མྱ་ངན་ ལས་ འདའ་བ་ ཡང་ སྟོན །_

Where to Merge (Correcting Breaks Made by the Script):

Just because the script makes a break does not mean it should be unequivocally accepted. You should still inspect each inflected verb to see whether that break makes sense. You can use your own judgment and common sense, but let’s examine some of the scenarios where you would want to correct the script by merging a break in the Tibetan.

(Note, to merge two chunks of text into a single TM segment you simply need to place them into the same cell from InterText and they will be merged in the exported TM. You don’t need to merge the units themselves.)

In the following examples an asterix “*” will represent a break created by the script:

Verb or Particle Misidentified by the Script:

As mentioned the script may misidentify some words. Especially homographs, two words that share the same spelling, or when a verb is actually being used as a noun. Please look out for such occurrences and use your common sense to identify and merge breaks that shouldn’t be there:

ཡང་དག་པར་ ལྟ་ ནས་*
From correct view…

The script identified ལྟ་ as the verb, “to look,” but it is actually the noun, “view” (as in someone’s conceptual or ideological perspective).

After ལས་/ ནས་ Particle that Marks Simultaneous Actions, Reason, or Otherwise Connects One Clause to Another in a Significant Way:

Usually when the ablative particle ལས་/ནས་ is placed after a inflected verb, it will be a simple conjunction indicating a sequential clause following it and be translated into English as “and then” or just “and”. Therefore, we can usually break each of these clauses into independent segments. However, sometimes the ནས་ may signify that the actions of the two clauses it joins are simultaneous, which will typically be correlated in the English with a translation like “while”:

རྟ་པས་ རྟ་ བཞོན་ ནས་* མེ་ མདའ་ རྒྱབ་ སོང་��
The horseman fired the gun while riding on the horse.

An incorrect break after inflected verb བཞོན་ ནས་

Also the ནས་ may indicate a stronger correlation between the preceding and following clauses. For instance the prior clause that may indicate the reason or place of the following one, or for some other reason the prior clause may be a necessary adjunct of the following one. Consider the following segments:

སེམས་ སྡུག་ ནས་* བུ་ ངུས །
Distraught, the boy cried.

བླ་མ་ དྲན་ ནས་* ཁོང་ བཤུམས །
Remembering her guru, she wept.

ལྷ འི་ བུ་ དག་ གང་ སེམས་ཅན་ ལ་ལ ས་ སངས་རྒྱས་ བཅོམ་ལྡན་འདས་ མྱ་ངན་ ལས་ འདས་ ནས་* ལོ་ བརྒྱ་ སྟོང་ ངམ །_ བསྐལ་པ་ བྱེ་བ་ ལོན་ ཡང་
Gods, even though one thousand years, an eon, or even ten million eons may have elapsed since the Bhagavān Buddha entered parinirvāṇa,

In all these cases there will certainly be some gray areas. You should use your own common sense judgment for determining when the prior clause can be understood standing on its own, or needs to be merged as an adjunct to the following one.

After a ལྷག་བཅས་ Particle (སྟེ་, ཏེ་, or དེ་):

Usually the ལྷག་བཅས་ particle will precede a summary, reiteration or elaboration of the preceding clause or sentence. It is usually suitable to break here, and so that is why the script has been set up to break on clauses followed by a ལྷག་བཅས་, but again, use your own judgement for determining cases when the two segments should be merged:

དེ་འོད་དེས་རེག་སྟེ་*བསྐུལ་ནས། The light touched and inspired him

Here for example the two verbs “རེག་” and “བསྐུལ་” share the direct object and agent. They are also two short actions that are sensibly part of the same complete thought.

After a ན་ Particle:

Although the script will break after a ན་ case particle following an inflected verb, this is one scenario that needs to be confirmed with some careful discernment. Usually the ན་ marks a conditional or temporal clause, “when…” or “if…” so technically this prior clause is an adjunct that describes the condition, time, or cause of the following clause, however, we are suggesting a break be placed here because in most cases the conditional or temporal clause can in fact stand on its own, and doesn’t require the main clause to be understood. Again, here you should read the clause to yourself and use your common sense to judge whether or not the break should be there. Here are some examples of acceptable breaks performed by the script after ན་ particles:

རྒྱལ་བའི་ ཆོས་ རྣམས་ སྨོན་པར་ བྱེད་པའི་ མདོ་སྡེ་ འདི །_། གཞན་ ལ་ ཡུད་ ཙམ་ གཅིག་ ཅིག་ སྟོན་པར་ བྱེད་ གྱུར་ ན།
“Yet if for just a brief moment, Another teaches others this sūtra, That points to the Dharma of the victorious ones,

དེ་ ནི་ དེ་ བས་ བསོད་ནམས་ འབྲས་བུ་ ཁྱད་ པར་ འཕགས །
The fruit of his merit will exceed the former.

གལ་ཏེ་ ཆར་ འབེབས་པར་ མི་ བྱེད་ ན །
However, if the nāgas do not send rain,

མཛེས་ ཐེབས་པར་ འགྱུར་ རོ །
leprosy will break out.

བཅོམ་ལྡན་འདས་ ཡུལ་ སྤོང་ བྱེད་ ན །
When the Bhagavān wandered in the land of Vṛji,

ལྗོངས་ རྒྱུ་ ཞིང་ གཤེགས་པ་ དང་ ། གྲོང་ སྤྱིལ་བུ་ཅན་ དུ་ བྱོན་ ཏེ །
he arrived at the village of Kuṭigrāmaka and

These pairs segments can clearly be understood on their own. It is also fine, as in the final segment above, for the subject to be only implied it the second clause, since this is common to most Tibetan clauses and isn’t necessary to understand the action of the clause.

However, in other cases, the two clauses will be more dependently linked and they should be merged together:

གྲུབ་ པར་ གྱུར་ ན །* སངས་རྒྱས་ ཉིད་ ཀྱང་ སྟེར་བར་ བྱེད་ དོ །
Being accomplished, she can grant even the state of awakening.

Here the first clause “གྲུབ་ པར་ གྱུར་ ན །” is not just a condition for the following clause, but there is a sense that it is instrumental in the action of “སྟེར་བར་ བྱེད་”, it is the means through which the granting is done. Therefore, the first clause needs to be included with the second one because it describes and modifies how that second action is done. This particular segment is again a gray area, and this call should be made according to your own judgement and with the full context of the passages that precede and follow.

ཡབ་ ཡུམ་ བདག་ ཅག་ བཏང་ ན་* ལེགས་ ཏེ་
Father, Mother, it is excellent that you let us go.

These two clauses need to be joined because the clause before the ན་ is the subject of the implied linking verb in the second clause coming after the ན་.

དེ་དག་ ངས་ ཆོས་ བསྟན་ ན་* ཤེས་ པར་ འགྱུར་ ནས །
They will understand the Dharma taught by me, and

Here the action of the first verb “བསྟན་ ” is the object of the verb “ཤེས་” in the second clause, so these two clauses should be kept merged into a single segment.

Sometimes the conditional statement may be so brief that it should be included in the following clause if it intuitively seems part of the same complete thought. For example:

དེ་ དག་ བྱས་ ན་* མྱ་ ངན་ མེད །
Then later you will have no regrets.

To make such judgement calls, It is good to ask yourself whether the segment is a useful piece of information. Exceedingly short TMs of just a few syllables certainly will not provide a translator with any useful information when it is recalled, especially here when the translator has bent the meaning slightly with “དེ་ དག་ བྱས་ ན་ = Then later”. By itself, this doesn’t tell you anything, and needs to be given a bit more context to be a useful TM.

Short Series of Actions:

Often a series of short actions defined by inflected verbs will be defaultly broken by the script. These should be joined, especially if they are the subject or object of another verb in another clause:

འཁོར་ གྱི་ དཀྱིལ་འཁོར་ དེ་དག་ ཆོས་ དང་ ལྡན་པ འི་ གཏམ་ གྱིས་ ཚིམ་པ ར་ བྱས །_ *བསྐུལ །_ *གཟེངས་བསྟོད །_ ཡང་དག་ པར་ དགའ་བ ར་ བྱས་ ཏེ །_
he pleased his surrounding retinue with a teaching on Dharma, and encouraged, uplifted, and complimented them.

The three verbs at the end of this segment are follow ups to the verb “ཚིམ་པ ར་ བྱས །”, they should be merged, especially since they all share the same object, “འཁོར་ གྱི་ དཀྱིལ་འཁོར་ དེ་དག་” stated in the first clause.

Final རྫོགས་ཚིག་, or Completion particles (-འོ།, མོ།, སོ།, etc…):

Since the རྫོགས་ཚིག་ completion particle is a full stop, then this is the one case where we are 95% sure that a segment would end here. However, there are still some cases where you will want to change or readjust these breaks after the completion particle made by the script, for example:

དེ་དག་ ནི་ ཆོས་ མ་ བསྟན་ ན་ ནི་ ཤེས་ པར་ མི་ འགྱུར་ རོ་* སྙམ་ མོ།
He thought, “but they will not understand the Dharma if I do not teach it.”

“སྙམ་ མོ། / He thought,” is too short to be a useful TM, so the break after the first completion particle should be removed. However, if a verb or thought or speech is longer than just two syllables and contains adjoining subjects, adverbs, etc. then it should be more likely be set apart as it’s own segment.

…