Those words that can occur in attributive position and are not verbal nouns we define as adjectives; they are almost always multisyllablic. The paragon of adjectives are those that in -po, -mo, -pa, -ma, -can, -ldan, and -med. Etymologically -can, -ldan 'with', and -med 'without' are verbs, but when they occur in forms that are functioning nominally with the regards to syntax, they cannot be analyzed as verbs. So, we treat them together with the preceding syllable as an adjective. The formation of adjectives is a complex; which we intend to write about. In the mean time a search for words with this tag in the corpus will reveal some patterns. (Although words such as nag 'black', gsar 'new', and che 'big' are frequently treated like adjectives for pedagogical purposes, a single syllable in predicate position before verbal suffixes is a verb. These words may appear to occur attributively, but we see this as the formation of compounds. The compound blon-chen 'prime minister' contrasts beautifully with noun and adjective pair blon-po chen-po 'great minister', etc.)
We use this tag for adverbs that end in -cad, i.e. phyin-cad 'after', sṅon-cad 'before', man-cad 'below', yan-cad 'above', slan-cad 'after'. We also include phan-tshun 'mutually' in this category, for lack of a better place to put it. Words tagged with this tag include: ཡན་ཆད་, མན་ཆད་, ཕན་ཚུན་, ཕྱིན་ཅད་, ཚུན་, ཚུན་ཅད་, སྔོན་ཅད་, སྔོན་ཆད་, སླན་ཅད་, ཕན་ཆད་, མན་ཅད་, ཡན་, ཡན་ཅད་
This tag was created for the adverbs rab (tu) 'very' and śin (tu) 'very', because as uninflected stems they do not occur independently. Subsequently the tag has also been applied to ཅུང་ཟད་, ཡེ་, ཧ་ཅང་, ཆེས་, ནན་ཏ, and ཅུང་ཙམ་.
This tag is used for onomatopoeia and phonophors. Examples include ཁོར་ཁོར་ཡུག་, ནར་ནར་, ཙབ་ཙོབ་, རྩོག་རྩོག་, ཁྲེ་ལོག་ལོག་, གཅེར་རེ་, ཅེ་རེ་, ཏབ་ཏབ་པོ, ཏབ་ཏོབ་, ཏུར་ཏུར་པོ་, རིག་རིག་, ལིང་, and ཧྲིག་ཧྲིག་.
A small number of words occur clause initially and refer to the content of the previous clause. Such adverbs often begin with a demonstrative stem. These words are classed as 'proclausal adverbs'. In the following list they are presented together with their affixal suffixes, but in our tagging we divide off these suffixes in order to be consistent with their treatment elsewhere: de (nas) 'then', de (ste) 'thereafter', gal (te) 'if', ḥo (na) 'in that case', ḥon (te) 'nevertheless', yaṅ (na) 'alternatively'. Words tagged with this tag include: དེ་, ཡང་, གལ་, ཁྱད་པ, འོ་, འོན་, དེ་བ, ལ, འང་, and སྒོས་.
Temporal adverbs are those that occur in syntactic positions or have morphological structure that suggests they are nouns, that refer to time, and that are not followed by case markers. In our corpus so far the temporal adverbs are sṅon 'previously', da 'now', deṅ 'these days', mdaṅ 'yesterday', gdod 'at first', da-ruṅ 'still', phyi-ñin 'the next day', phyi-dro 'in the afternoon', and saṅ 'the next day'. There are also nouns that refer to time such as źag 'day' and gdugs 'noon', but these behave syntactically as nouns, for example by being suffixed with case markers. In the phrase saṅ gi gdugs la, where saṅ also appears to function as a noun, we have hesitatingly decided that for the time being it is best to tag saṅ as a temporal adverb. Words tagged with this tag include: ད་, སྔོན་, སྔ་, ད་དུང་, ནང་པ, ཕྱི་ཉིན་, ཁ་སང་, དོ་ནུབ་, གདོད་, ཉིན་, དེ་རིང་, དེང་, ནུབ་, མཚན་མོ་, ཅིག་ཅར་, གཟོད་, ཐོག་མ་, དབྱར་, དིང་སང་, མདང་, མཚན་, སང་, དགུན་, དེང་སང་, ནངས་པ, ཕྱི་, ཡུན་རིང་, སྲོད་, གདུགས་ཚོད་, གནངས་, གུང་, གློ་བུ, ཉིན་མཚན་, ད་ལོ, ནང་, ནངས་, ཕྱི་དྲོ་, མཚན་ཡལ་, འདང་, སླན་ཆད་, ཁར་རྩང་, ཁྲུགས་, གནའ་, གཟོད་མ་, ཉིན་ཞག་ཕྲུགས་, ད་ནང་, ད་རུང་, དོ་ནུབ, ནང་བ, ནངས་བ, ནམ་བཞིག་, ནུབ་མོ, ཕྱི་ཕྱི, མཚན་ཐོག་ཐག་, འཆར་ཁ་, འདང་གསུམ་, འཕྲོ་, རྒྱ་, སང་ཉིན་, སང་དགོང་, སང་ནང་པ, སྐྱ་རེངས་, སྔ་དྲོ་, སྔ་སོ, སྔ་སོ་, སྔོན, སྔོན་མ་.
The ablative case is the morpheme las when it follows a substantive.
The agentive case is marked by any of the allomorphs -kyis, -gis, -gyis, -s, and -yis when suffixed to a substantive.
The allative case is the clitic morpheme la when it occurs suffixed to substantives.
The associative case is the clitic morpheme daṅ when suffixed to a substantive.
The comparative case is either of the allomophs -bas or -pas suffixed to an substantive.
The elative case is the clitic morpheme -nas when suffixed to substantives.
The genitive case is any of the allomorphs -kyi, -gyi, -gi, -yi, or -ḥi, when suffixed to a substantive.
The locative case is the clitic morpheme -na when suffixed to a substantive. Note however that when -na appears after a verbal noun ending in -pa or -ba in the meaning 'if/when', we tag this -na as 'cv.loc'.
The case suffix -na-re which introduces direct speach.
The terminative case is any of the allomorphs -tu, -du, -ru, -su, -r when suffixed to substantives.
This tag was invented for the focus clitics ni and kyaṅ ~ yaṅ. The words currently tagged with this tag include: ནི་, ཡང་, ཀྱང་, འང་, ཅང་, and ཕྱིར་ཡང་. It is probably not correct to tag ཕྱིར་ཡང་ with this tag, but we must consider the matter further. The word lta when it is not tagged as a verb, or a relator noun, is also tagged as a focus clitic.
We invented this tag for the 'quotative clitic' with the allomorphs ces, źes, and śes. However, the forms ce-na, ces-pa, etc. reveal that more properly this should be called a 'quotative verb' and tagged with a tag that begins with 'v'. Nonetheless, it is now inconvenient to change our approach. The words tagged with this tag are: ཞེས་, ཅེས་, ཞེས་པ་, ཞེ་, ཅེས་པ་, ཤེ་, ཅེ་, ཅེས་བ་, ཞིས་ (a misspelling of ཞེས་), and ཞེས་བ.
The ablative converb is the morpheme las when it follows a verb.
The agentive coverb is marked by any of the allomorphs -kyis, -gis, -gyis, -s, and -yis when suffixed to a verb.
The allative converb is the clitic morpheme la when it occurs suffixed to verbs.
This tag is used to tag the converb -ta-re and its allomorphs (cf. Walter Wimon, 1967, 'The Tibetan Particle "re",' Bulletin of the School of Oriental and African Studies, 30.1, pp. 117-126).
The associative converb is the clitic morpheme daṅ when suffixed to a verbs. Note that this tag is particularly used with the polite imperative meaning of daṅ. One also sees daṅ in the meaning 'and' directly suffixed to morphological verb stems. In such cases we see daṅ as evidence that the preceding form is funtioning nominally and should be tagged [n.count], barring reasons to do otherwise.
The continuing converb, unknown in the early literature, is formally similar to the genitive followed by -n and functionally is similar to the imperfective converb in earlier strata of literature. The words tagged so far with this tag are: གིན་, ཀྱིན་, གྱིན་.
The elative converb is the clitic morpheme -nas when suffixed to verbs.
The final converb is any allomorph of the marker -ḥo, which is an explicit marker of a finitie verb in the indicative mood. Words tagged with this tag include: སོ, འོ, དོ, ངོ, ནོ, ཏོ, རོ, ལོ, གོ, བོ, མོ, and པོ.
The genitive converb is any of the allomorphs -kyi, -gyi, -gi, -yi, or -ḥi, when suffixed to a verb.
The imperative converb is the morpheme that explicitly marks the imperative or prohibitive mood, which has one of the three allomorphs -śig, -cig, and -źig. Our corpus yields the unexpected śig-pa, which we have also tagged with this tag.
The semi-final converb is the morpheme that has the three allomorphs ཞིང་, ཅིང་, and ཤིང་.
The locative converb is the clitic morpheme -na when suffixed to a verbs, or when suffixed to a verbal noun ending in -pa or -ba in the meaning 'if/when'.
Odd converbs are any that we discovered after fixing the tagset. So far we the following words with this tag: ལས་ཆེ, ཨང་, ཚད་.
The question converb is any allomorph of the marker -ḥam, which is an explicit marker of a polar question. Words tagged with this tag include: འམ, སམ, ནམ, དམ, གམ, ངམ, ཏམ, རམ, ལམ, བམ, མམ, ངེ་, and མཾ་
When the morpheme ruṅ appears after a verb meaning 'however' it is tagged with this tag.
The semi-final converb is the morpheme that has the three allomorphs ཏེ, སྟེ, and དེ.
The terminative converb is any of the allomorphs -tu, -du, -ru, -su, -r when suffixed to a verb.
This tag is used for the demonstratives ḥdi 'this' and de 'that'. These two words are tagged as demonstratives also when used as determiners (i.e. we do not distinguish rgyal-po de 'that king' from de 'that one, him'). Words tagged with this tag include: དེ་, འདི་, ཡ, ཚུ, མ, ཕ, ཕྱི་, ཕ་གི་, མ་ཀི, མ་གི, མ་གི་, མ་ཀི་, འདི་པ, ཡ་གི་.
The most frequent determiner is gźan 'other'. In addition, we identify ya-re 'each one (of two)' as a determiner on the basis of the following sentence: Brgya-byin daṅ Tshaṅs-paḥi rgyal-pos lag-pa ya-re nas zin te 'The kings Indra and Brahma each took him by one of his hands'. We reckon ḥbaḥ 'sole' as a determiner on the basis of sentences such as rus-pa daṅ khrag ḥbaḥ źig gis sa rtsog-rtsog ltar ḥdug-pa mthoṅ 'They saw the ground besmirched with only bone and blood'. Words tagged with this tag include: གཞན་, སོགས་, རང་, ཀ་, འབའ་, རེ་, རེ་རེ་, ག་, ཤ་སྟག་, ཁ་, གཞིན་, ཡ་རེ་, ཤས་དག་, སྣ་རེ་. We must check gzhin.
We initially invented this category for ñid in phrases such as rgyal-poñid 'that very king' or lusñid 'this body'. This syntactic use of ñid must be distinguished form its use in Buddhist terminology -ñid inside of words, e.g. stoṅ-pa-ñid 'emptiness'.Apart from ñid, we have categorized kho-na 'the very, same' and re-re 'each' as emphatics. This use of kho-na should not be confused with its function as a third person pronoun in Old Tibetan. In one case kho appears not as a personal pronoun but as what seems to be a variant of kho-na; this kho we also classify as an emphatic, viz. smras-paḥi tshig ḥdi bden na bden-paḥi tshig bden-paḥi tshig smras-pas / bdag gi lus ḥdi sṅa-ma kho bźin du rma med-par gyur cig 'If these words that I have said are true, then because of saying true words, let this my body be without wounds like before' (vol 74, page 137b). Words tagged with this tag include: ཉིད་, རབ་, ཁོ་ན་, and ཁོ་.
This category is used for the allomorphs of the indefinite marker cig, źig, and śig as in pho-ña cig 'a messenger'. The indefinite marker, which occurs inside of noun phrases, must be distinguished from the identically looking imperative converb (see below), which occurs suffixed to the imperative stems of verbs.
The plural markers rnams, dag, kun, thams-cad, ḥo-cog (and its variants) and tsho are tagged as their own category 'plural'. However, plural pronouns (bdag-cag, khyed-cag, ḥu-bu-cag) are treated as one word. The plural marker -cag is not removed because to do so would result in pronominal stems which are not mutually comparable (viz. bdag is a singular pronoun, khyed a plural pronoun, and ḥu-bu has no independent life outside of ḥu-bu-cag). We also tag ḥgaḥ 'some' as a plural, although in the abstract one would perhaps prefer to call it a 'quantifier'. The three verbs (la) sogs-pa 'etc', (daṅ) ldan-pa 'having', (daṅ) bcas-pa 'together with' could be seen as similar to quantifiers or otherwise to be treated as parts of the noun phrase, however, we have chosen to treat them etymologically as verbs. Words tagged with this tag include: རྣམས་, དག་, ཐམས་ཅད་, ཀུན་, ཡོངས་, སྣ་ཚོགས་, འགའ་, སྤྱི་, ཅུང་ཟད་, ཚོ་, དུ་མ, ཤས་, མཐའ་དག་, ཆུང་ཟད་, དགུ་ཅོག་, འགའ་རྫི་ར་, སོ་ཅོག་, སྣ་དགུ་. The inclusion of འགའ་རྫི་ར་ is probably a mistake.
In an earlier version of the tag set we had tagged tsam as [cl.tsam], but then it became clear that it only ever followed nominals, so it clearly deserves the [d.xxx] category. Words tagged with this tag include ཙམ་, སྙེད་, ཙམ་པ་, རྙེད་ (a misspelling for སྙེད་), ཙམ་པ, སྙད་, and སྙེད་པ.
This tag is used for interjections. Words tagged with this tag include: ཀྱེ་མ་, ཀྱེ་, ཨ་ལ་ལ་, ཀྱི་ཧུད་, ཁོ་དེ, མ་ལ་, འོ་དོད་, ཧེ་, ཨེ་མ་, and ཨེ་ཨེ་.
To identify lexical nouns we rely on the syntactic ability of the word in question to head a noun phrase, the dictionary meaning, and (when possible) the presence of nominal suffixes such as -mo, -po and -bu. Because we treat grammatical affixes as separate words, a single word normally does not not include grammatical affixes such as case markers and converbs. Nonetheless, there are well-motivated exceptions to this policy like gaṅ-na-ba 'whereabouts' and bdag-gi-ba 'that which is mine'. (We treat a -pa or -ba as part of the preceding word, regardless of the part-of-speech of the preceding word.) Another frequent category of exceptions is calques of Sanskrit terms, e.g. kun-tu-rgyu 'parivrājaka', which must be treated as a noun together because it looks like a verb phrase, but functions syntactically as a noun. In general, when two nouns occur in succession they are understood as a compound; the dandva compound pha-ma 'parents' is treated as one noun rather than two (pha 'father' and ma 'mother') and the tatpuruṣa compound khyim-bdag 'householder' is likewise treated as one noun rather than two (khyim 'home' and bdag 'lord'). When an adjective precedes its head this is also treated as a compound. Thus, because dug btsan-po would be the expected order for 'mighty poison', we treat btsan-dug 'mighty poison' (vol. 74, page 147a) as a single word. Apposition is the one category of exceptions when the concatenation of two nouns is not treated as a compound. An example of this type is the two words bu khyeḥu in the following sentence: deḥi tshe yul de na khyim-bdag cig la bu khyeḥu źig btsas na / 'At that time, in that land, when a child, a son, was born to a householder'. Rather than understanding khyeḥu as an adjective modifying bu, or taking bu-khyeḥu as one word, the second word simply sits after the first one to add greater specificity. The reason to not treat apposition as compounding is because apposition occurs with proper nouns. For example, in the sentence deḥi tshe na rgyal-po Gsal-rgyal gyi btsun-mo chen-po Ḥbar-li źes bya-ba la bu-mo źig btsas-nas 'At that time a daughter was born to Ḥbar-li, the main queen of king Prasenajit', rgyal-po 'king' and Gsal-rgyal 'Prasenajit' are in apposition; to unite the two as a compound would lead to unintuitive, unwieldy, and therefore unacceptable consequences.
We divide out mass nouns from normal lexical nouns on the basis of two instances in our corpus where otherwise two nouns not in apposition would follow each other: nor-bu sbar gaṅ 'a handful of jewels' and chu sñim-pa gaṅ 'a handful of water'. Knowing that there exists this syntactic difference between normal lexical nouns and mass nouns, we tag all plausible mass nouns on the basis of their meaning (e.g. zaṅs 'copper'). A final list of mass nouns can only be securely put forward after the syntactic behavior of these words is better investigated. Words tagged with this tag include: གསེར་, ཆུ་, རིན་པོ་ཆེ་, ནོར་, ཤ་, ཆུ, ས་, ཁྲག་, ཆང་, སྨན་, སེར་བ་, ནོར་བུ་, མར་, ངུར་སྨྲིག་, དངུལ་, ཡི་གེ་, ནས་, ལྕགས་, སྤུ་, གངས་, ཙན་དན་, གསེར་གཡུ་, གོས་, དར་, ཤེལ་, གསེར་དངུལ་, པདྨ་རཱ་ག་, མར་ཁུ, ཟངས་, ཡིག་ (I would think that this should only occur in compounds), གོས་དར་, ཆུ་སྡོར་, ཐེར་ཕྲུག་, བཟའ་, བསུ་ཆང་, བུམ་ཆུ, བཻདཱུརྱ, མཐོ་ཁྱད་, རིན་ཆེན་, སྣམ་བུ་, ཁ་བ་, གསེར་གྱི་མེ་ཏོག་ (a mistake), ཆུ་མེ་ཤིང་, ཉ་ཕྱིས་, ཐལ་བ་, ཐུད་, དབུས་ཕྲུག་, བུར་, མཐོན་མཐིང་, མར་གད་, མར་སར་, ཚོས་, ཟ་མ་བཅུད་, འཇིམ་པ་, འདག་, རིན་ཆེན་གསེར་, རྒྱགས་ཆུ་, རྡུལ་ཚོན་, རྣག་ཁྲག་, ལ་ཅ, ལ་ཆ, ལེ་བརྒན་, ཤ་ཁྲག་, སེང་ལྟེང་, སེང་ལྡེང་, སེར་, སྤེན་, and སྤེན་བད་.
This tag is used for personal and place names.
A relator noun is a noun, normally one syllable, which has a genitive before it and a spatial case (allative, locative, terminative) after it, e.g. deḥi naṅ na 'inside of that', deḥi druṅ du 'before him', deḥi ḥog tu 'under that', deḥi tshe na 'at that time'; relator nouns are not quantified and are not suffixed with adjectives, determiners or demonstratives. After identifying the class of relator nouns, these words are tagged as relator nouns even in syntactic contexts missing the genitive to left or the spatial case to the right. For example, in the sentence deḥi tshe blon-po źig phyi-rol nas naṅ du ḥoṅs-pa las/ mi btson du bzuṅ-ba mthoṅ-ba daṅ / 'Then the minister went inside from outside and saw the man who had been taken to prison', the relator noun tshe is not followed by a spatial case and the relator noun naṅ is not preceded by a genitive. In the phrase bar ḥgaḥ 'sometimes' we do not consider bar a relator noun because it undergoes quantification, e.g. … bar ḥgaḥ ni gti-mug gi phyir lus btaṅ yaṅ chos kyi phyir bsod-nams kyi źiṅ daṅ lan ḥgaḥ yaṅ ma phrad-paḥi lus ḥdi ci ruṅ ? 'What is the use of this body which … sometimes has been used because of ignorance, but has not yet met an occasion (to serve) as a field of merit'. Words tagged with this tag include: ལྟ, ཕྱི, བཞིན་, ཚེ་, སྐད་, བར་, དུས་, ནང་, རྗེས་, དྲུང་, ལྟ་བུ, འདྲ་, འོག་, སྟེང་, ལྟ་བུ་, སླད་, ངང་, ཕྱོགས་, མཐའ་, སྒོ་, བཞིན, ས, སླ, མདུན, དབུས་, ཐོག་, ཁོང་, ཕྱི་བཞིན་, ཁ, ཆེད་, ཐོག་མ, སྐབས་, གོང་, རྟིང་, ཐད་, གན་, མཇུག་, འཕྲལ་, ལོགས་, ཁྲོད་, རྟེན་, སྨད་, ཁོངས་, གསེབ་, ཐ་མ, དུས, ཕུ་, དྲུད་, བསེབ་, མགོ, ཙ་, རིང་, རྩ, ལྟག་, སྐོར་, དཀྱིལ་, དབུང་, བྱང་, མཐའ་ལོགས་, ཚུན་, ཞོར་, འཁྲིས་, འོག་བ, རྒྱབ་, སྐེ་, སྐེད་, སྙིང་ཁ, སྟེངས་, སྟོད་. Several questions of delimiting relator nouns require further consideration, in particular, the approach to lta versus lta-bu.
This tag is used for the nominalized version of verbs that would be tagged [v.aux]. Forms currently tagged with this tag include དགོས་པ་, ནུས་པ་, ཤེས་པ་, དཀའ་བ་, ཕོད་པ་, ཐག་པ་, and རན་པ་.
This tag is used for nominalized versions of verbs that would be tagged [v.cop], q.v.
This tag is used for the nominalized versions of future verb stems, i.e. a future verb stem followed by pa, ba, tshul, rgyu, mkhan, mi, or bya.
This tag is used for nominalized forms of verb stems that would be tagged [v.fut.v.past], q.v.
This tag is used for the nominalized version of a verb stem that would be tagged [v.fut.v.pres], q.v.
The imperative stem of a verb is supposed not to be nominalized. However, in rare cases there are nominalized forms which appear to come from imperative stems, and we mark them accordingly. The form ཐོངས་པ་ is the only case that has come up so far.
This tag is used for nominalized forms of verb stems that would be tagged [v.invar] q.v.
This tag is used for the nominalized forms of the inherently negated verb med, such as med-pa and med-sa.
This tag is used for the nominalized versions of past verb stems, i.e. a past verb stem followed by pa, ba, tshul, rgyu, mkhan, or mi.
This tag is used for the nominalized version of a verb stem that would be tagged [v.past.v.pres], q.v.
This tag is used for the nominalized versions of present verb stems, i.e. a present verb stem followed by pa, ba, tshul, rgyu, mkhan, or mi.
The two negation prefixes ma and mi are classified together in their own category. The polar question prefix e is also categorized under this tag because it occupies the same syntactic position as ma and mi. For the two verbs min and med negation is inherent to their meaning, consequently 'neg' is also added to their POS-tags (i.e. min|v.cop.neg and med|v.neg see above).
In numbers we distinguish cardinals (gcig, gñis, gsum, etc.) and ordinals (daṅ-po, gñis-pa, gsum-pa, etc.). Other derivatives of numerals are treated according to their respective syntax, thus gcig-pa 'sole' is an adjective, gñi-ga 'both' is an indefinite pronoun, etc. In higher numbers each component digit is tagged separately, to do otherwise would prevent the computer from learning pattens by virtue of having to independently learn each possible cardinal number of the infinite possibility. When a numeral follows a noun we regard the two as separate words. In addition to obvious cases like mi lṅa 'five men', we also treat dkon-mchog gsum 'triratna' as two words. While it is true that one will almost never encounter any other numeral after the word dkon-mchog this fact says as much about Buddhism as it does about syntax. The treatment of phrag well exemplifies our pragmatic attitude toward part-of-speech tagging. Although not a cardinal number itself, this syllable occurs inside cardinal numbers, effectively marking a certain place with a zero, e.g. stoṅ phrag drug cu '1060'. Because the internal structure of numerals is not of interest to our project and adding a new tag for phrag would add unnecessary complications to our tag-set, we treat phrag itself as a cardinal number. There are occasions when the morphology of a word suggests that it might contain a numeral (e.g. mṅon-sum 'real', phun-sum-tshogs 'marvelous'), but there is no reason to see such cases as synchronically analyzable. Numbers that end with -po are also tagged [num.card].
In numbers we distinguish cardinals (gcig, gñis, gsum, etc.) and ordinals (daṅ-po, gñis-pa, gsum-pa, etc.). Other derivatives of numerals are treated according to their respective syntax, thus gcig-pa 'sole' is an adjective, gñi-ga 'both' is an indefinite pronoun, etc. In higher numbers each component digit is tagged separately, to do otherwise would prevent the computer from learning pattens by virtue of having to independently learn each possible cardinal number of the infinite possibility. Some adjectives can be distinguished from ordinal numbers only in context. For example, in the phrase rdo-rje rtse-lṅa-pa 'a five-pronged vajra' the noun phrase syntax and the overall meaning of the passage dictates that lṅa-pa is part of the word rtse-lṅa-pa 'five-pronged' rather than a word 'fifth'.
The words la-la 'some', so-so 'each', and gñi-ga 'both' are used as indefinite pronouns in our pilot corpus. Words tagged with this tag include: ཁ་, སོ་སོ་, ལ་ལ་, རེ་རེ་, རེ་. (It looks like we are no longer putting gnyi-ga in here. I wonder why?)
This is the tag used for interrogative pronouns such as su 'who', nam 'when', and gaṅ 'where'. Words tagged with this tag include: ཅི་, གང་, ཇི་, ཅི, སུ་, ནམ་, ཇི་སྲིད་, ག་, ག་རེ་, ཇི་ཙམ་, ཅི་སྲིད་, and ཙུག་.
This tag is used for personal pronouns. Words tagged with this tag include: བདག་, ཁྱོད་, ང་, ཁྱེད་, བདག་ཅག་, ངེད་, ཁོ་, ཁོང་, རང་རེ་, ཁོ་བོ་, རང་, མོ་, ཁོ་མོ་, ཁྱེད་ཅག་, འོ་སྐོལ་, འུ་ཅག་, འོ་ཅག་. The word རང་ is only tagged as [p.pers] when it functions as a full independent pronoun in its own right; in ང་རང་ and ཁྱེད་རང་ the word རང་ is tagged [p.pers]. The word བདག་ is tagged as [p.pers] when it is a humble equivalent of ང་. When this word means 'lord' or ' the self' (philosophy) it is tagged [n.count] and when it means 'himself' it is tagged as [p.refl].
This tag is used for reflexive pronouns. The only word tagged with this tag རང་ and བདག་. The word རང་ is tagged as [p.refl] when it functions as a reflexive, in a sentence (invented) such as ཁོས་ དེ་རིང་ རང་ གི་ ཟན་ ཟོས་ སོ་ ' Today, he ate his own food' or in long form personal pronouns such as ང་རང་ and ཁྱེད་རང་. When རང་ is full independent pronoun meaning 'I' or ' you' it is tagged [p.pers]. The word བདག་ is tagged as [p.pers] when it is a humble equivalent of ང་; when it means 'lord' or ' the self' (philosophy) it is tagged [n.count] and when it means 'himself' it is tagged as [p.refl].
This tag is used for the words རང་ and བདག་ when they are used as reflexive pronouns, i.e. in the meaning 'himself, herself'. The word རང་ when it immediately appears after a personal pronoun is always tagged [p.refl].
While a tsheg is considered part of the preceding syllable, all other punctuation marks are tagged as punctuation. Such marks include །, ༑, །།, ༄༅༅།, and །།།།. Punctuation marks are broken apart to aid tokenization, e.g. we tag ༄|punc ༅|punc ༅|punc །|punc.
This tag was invented for the odd Sanskrit word that occurs in Tibetan (e.g. in mantras) not as a syntactic word in its own right, but as something metalinguisitc at least vis à vis Tibetan. On analogy to this use of Sanskrit words in Tibetan we have extended the use of the tag to include all metalinguistic forms, e.g. words used not as the words they are but as the name of that word, or the name of a letter, words in Roman script, etc.
This tag is used for auxilliary verbs. By this we understand only thos verbs that show no morphological stem variation and may occur directly after another verb stem. The words we tag with this tag are: ཐག་, དགོས་, ནུས་, མོད་, འདོད་, ཤེས་, སྲིད་, ཤེས་, རན་, གྲགས་, ཐང་, and ཕོད་. Note that this set is far fewer than the total number of verbs that might be considered auxiliary verb in classical or modern Tibetan. In fact, we now feel that it was a mistake the give these tags a separate encoding, instead they should have been tagged on the basis of their syntax (e.g. mi [neg] nus [v.pres.v.fut]) and the question of auxiliation and other syntactic uses of verb stems in Tibetan addresssed systematically in a future project. However, once we arrived at this understanding it was inconvenient to change our approach.
This tag is used for copula verbs, in particular yin and lags which do not distinguish various stems. In our current view, it would have been better not to have included this tag, but rather to have tagged these verbs according to their syntactic position, e.g. ma [neg] yin [v.past], yin [v.past.v.pres] nas [cv.ela], etc. However, when we made this determination it was already inconvenient to go back on the inclusion of the tag [v.cop].
This tag is used for the inherently negated copula verb min.
This tag is used for future stem verbs, e.g. bya, gzuṅ, sgrubs.
This tag is used for verb stems that are both morphologically and syntactically ambiguous between the past and futur, e.g. bskol.
This tag is used for verb stems that are both morphologically and syntactically ambiguous between the past and present, e.g. gśegs in the phrase mi gśegs.
This tag is used for imperative verb stems, e.g. byos, zuṅ, sgrubs.
This tag is used for verb stems that are morphologically and syntactically ambiguous among present, past, and future, e.g. gśegs in the phrase gśegs so.
This tag is used for the inherently negated verb med.
This tag is used for past stem verbs, e.g. byas, bzuṅ, bsgrubs.
A verb stem that is both morphologically and syntactically ambiguous between the past and present, e.g. gśegs in the prhase gśegs nas.
This tag is used for present stem verbs, e.g. byed, ḥdzin, sgrub.