Tibetan in Digital Communication is a research project funded by the Arts and Humanities Research Council, engaged in building a 1,000,000 syllable part-of-speech tagged corpus of Tibetan texts spanning the language's entire history. In addition to the corpus, the project is developing a number of digital tools that allows for the corpus to be employed in many areas of humanities research, and enables other researchers to more easily develop their own corpora or software tools.
The corpus will be a powerful resource for scholars working with Tibetan language materials in a wide range of disciplines —including history, religion, literature and linguistics—since it will offer ready access to, and comparison across, texts from different time periods, regions and genres. It will also provide an important foundation for subsequent work on a historically comprehensive, lexicographically rigorous dictionary of Tibetan.
By building this corpus for Tibetan, the cost of developing language technologies, such as text messaging, spellcheckers and machine-aided translation will be reduced. These technologies would give Tibetans the choice to use their language as they see fit in a world that is increasingly shaped by digital communication.
At present, the corpus consists of three distinct collections. First, there is the Classical corpus.
Second, there is the Saint Petersburg corpus, consisting of texts assembled and tagged by Pavel Grokhovski from Saint-Petersburg State University. We are re-tagging these texts in order to make them consistent with our scheme.
Third, there is the Berlin corpus, kindly provided to us by Michael Balk. This corpus comprises the entire Tibetan catalogue of the Berlin State Library. It came to us fully segmented, but not tagged. We are now tagging the corpus after having converted it from Extended Wylie to Unicode.
Together, these three collections constitute the "Complete corpus".
Texts are available in horizontal and vertical formats. In horizontal format, a single space marks the boundary between words, and line breaks separate sentences. Each word consists of a word form followed by a tag, with the pipe character in between. Whitespace is not permitted within words. For example:
མི་|neg བཤིག་|v.past ན་|cv.loc བསྲུངས་པ|n.v.past ས་|case.agn ཆོག|v.invar བཤིག་|v.past མི་|neg བཤིག་|v.past ལྟ|cl.lta འོ་|cv.fin ཟེར་བ་|n.v.fut.n.v.pres ལ|case.all །|punc མ་|neg བཤིག་པ|n.v.past ར་|case.term མཁར་ལས་|n.count ནར་མ|adj ར་|case.term བྱས་པ|n.v.past ས|case.agn །|punc
Each line of a horizontal file corresponds to a page of text. We have made no attempt to ensure that page breaks correspond to sentence breaks; therefore, logical sentences are often split over two lines. However, we do not allow page breaks to be inserted within words. When this happens, we have moved the page break to the nearest word boundary.
In vertical format, each word occurs on a separate line, with sentence breaks indicated either by a blank line or by special word forms. Word forms are separated from their tags by a tab; therefore, word forms are permitted to contain the single space character. The part-of-speech tag is usually followed by a lemma to which the word form belongs when it has that tag. We follow convention by inserting a dash to indicate that we are not tracking lemmas. Here's the same two sentences in vertical format:
མི་ neg - བཤིག་ v.past - ན་ cv.loc - བསྲུངས་པ n.v.past - ས་ case.agn - ཆོག v.invar - བཤིག་ v.past - མི་ neg - བཤིག་ v.past - ལྟ cl.lta - འོ་ cv.fin - ཟེར་བ་ n.v.fut.n.v.pres - ལ case.all - ། punc - མ་ neg - བཤིག་པ n.v.past - ར་ case.term - མཁར་ལས་ n.count - ནར་མ adj - ར་ case.term - བྱས་པ n.v.past - ས case.agn - ། punc -
Long sentences can cause problems for some tools that use vertical format. For this reason, we delimit vertical files differently that horizontal files. Rather than inserting sentence breaks at page breaks, we instead insert sentence breaks after sequences of punctuation.
Any use of these downloaded files must comply with the terms of the CC BY 4.0 license.
Numerous Tibetan lexicons are available for download from this site. First, we make use of a processed and somewhat modified version of Nathan Hill's A Lexicon of Tibetan Verb Stems as Reported by the Grammatical Tradition. Using this lexicon, we have been able to pre-tag many verb stems and verbal nouns whose tags would otherwise not be known.
In addition to the verb lexicon, we also generate mini-lexicons for each text or text collection that is being tagged. Finally, the complete lexicon comprises the main corpus together with the verb lexicon.
Lexicons are stored and distributed in vertical format. Each word form has its own line, with tabs separating possible readings. Each reading has two parts: a part-of-speech tag, and a lemma to which the word form belongs when it has that tag. Since we are not tracking lemmas, the lemma field is always left empty; by convention, this is indicated with a dash.
ཐོ་ལེ་བ་ adj - ཐོག v.fut - v.imp - ཐོག་ n.count - n.rel - v.fut - v.imp -
Note that our system treats word forms with and without tsheg (e.g. ཐོག་ and ཐོག) as separate lexical entries, because the two forms may have different distributions.
Lexicons are updated nightly, and made available for download through the above links. Any use of these files must comply with the terms of the CC BY 4.0 license.
The website's search functionality is currently limited to exact match searching for Tibetan words. If you enter a Tibetan word, then the system will find all occurrences of the word, allowing you to further narrow your search by part-of-speech if the word form is ambiguous. For example, try typing ཐོག་ into the search box.
A second kind of searching helps to find interesting patterns in pos taggings. In the "shingle search" interface, whole corpora are tagged from scratch using our current best segmenter followed by the rule tagger. These search pages use a Flash plugin, ZeroClipboard, to copy the shingle tables to the clipboard, and to export them to CSV, Excel, or PDF formats. These functions won't work on mobile platforms and browsers lacking Flash.
Here are some sample searches to explain how shingle searching works.
Exclude ambiguous tags
As noted above, shingle search uses our current best segmenter followed by the rule tagger. When the rule tagger isn't sure what the tag for a word should be, it gives an ambiguous answer. Setting this option excludes ambiguous answers from the returned results.
Require ambiguous tags
For other purposes, it may be useful to only return results that contain ambiguous answers. For example, when devising new rules, it can be helpful to target frequent ambiguities.
Show word forms
Normally, shingle search returns pos tags only. By selecting this option, word forms are returned instead.
+ partial matches
Preceding a pos item with "+" indicates that only a partial match is required. [+v] will match every tag that starts with "v", and [+v,n.v] will match every tag that starts with either "v" or "n.v". Note that both [+v,+n.v] and [v,n.v] are ill-formed searches. With this in mind, consider the following query:
[v.fut] [cv.term] [+v,n.v]
With shingle size set to 4, the following query finds sequences of four pos tags, starting with the head of a noun phrase and continuing with additional tags that could be part of a noun phrase.
[n.count] [+num,d,n.,adj] [+num,d,n.,adj] [+num,d,n.,adj]
- partial excludes
If a pos item is preceded by "-" then words with that pos tag should not match. The following query resembles one above, except that the minus matches anything but tags that begin with "v.fut" and "v.past".
[-v.fut,v.past] [cv.term] [+v,n.v]
One approach recasts word segmentation as a problem of syllable classification. Each syllable in a word is tagged in one of 8 ways. The lone syllable of a single syllable word is tagged "S". In multisyllabic words, the initial syllable is tagged "X", and the final syllable is tagged "E". Other syllables are tagged with Y, Z, or M, as follows: X-Y-E, X-Y-Z-E, X-Y-Z-M-E, X-Y-Z-M-M-E, and so on. Two additional tags are used for Tibetan's complex syllables. The tag "SS" is used for two single syllable words which are joined together, such as འདིའི་. And the tag "ES" is used for a word-final syllable that is joined together with the following single syllable word, such as the པོས་ in ཆེན་པོས་. It must be understood that for the purposes of word segmentation, these complex syllables must be split off into separate words: འདིའི་|SS will become འདི + འི་, and ཆེན་|X པོས་|ES will become ཆེན་པོ + ས་.
In some cases there are two ways in Unicode to encode a Tibetan character. In order to simplify the statistical models we normalize to the more common encoding. When uploading new texts it is useful to ensure that the following changes have been made:
- "༎" changed to "།།"
- "༌" (typical after ང and before śad) changed to "་"
To see a list of tags used by the various corpora, click on the links to the right. The Classical corpus is the authoritative reference for our tagset, so should be the first point of reference. As noted above, the Saint Petersburg corpus comes via Pavel Grokhovsky, who used a different tagset. We have been converting his tagset to our system, but this work is not yet complete. Finally, the Berlin corpus is only partially tagged, and so includes a great many words with dummy tags (for example, xxx).
The aim of pre-tagging is to present the human tagger with a reduced set of choices when tagging a text. The pre tagger can leave difficult tagging decisions to the human, but it should make every effort not to eliminate possible tags. Human taggers are to download the pre tagged outputs, and then upload their corrections back to the system.
In this interface, pages that have not yet been hand tagged are pre tagged, by applying the current best segmenter followed by the rule tagger. Pages that have already been hand tagged are also pre tagged, enabling the performance of the segmenter and rule tagger to be easily assessed alongside the correct tagging.
After hand tagged texts are fed back into the system, they are checked using the tag suggestions mechanism. For each corpus, the system generates a list of tag suggestions, which draw attention to those cases where the rule tagger gives a different answer to the human tagger.
This purpose of this step is twofold. On the one hand, the machine's correct suggestions bring attention to mistakes or inconsistencies in the human tagging. On the other hand, the machine's incorrect suggestions point us to rules that need to be revised or removed.
The lex tagger takes a segmented text as input, and assigns to each word every possible part of speech tag it can have. Lex tagging outputs are used as inputs to the regex and cg taggers. They can also be used as baselines against which to compare the performance of other taggers.
Various lex tagged texts are available for download here, in horizontal and VISL CG format. VISL CG format is a kind of vertical file required as input to the CG3 tagger (see documentation).
If every word is in the lexicon, then the lex tagger will be 100% accurate when applied to an input text. However, the output of the lex tagger is still highly ambiguous, since many words have more than one tag. This is where the rule tagger comes in. Its job is to use contextual rules to eliminate impossible tags and thereby reduce ambiguity, while retaining near perfect accuracy. First, the lexicon is used to assign to each word of a text all of its possible tags. Then, the rules are applied in order, stripping off tags that are not possible given the surrounding context.
Each rule package includes a background explanation along with a concise statement of the rule itself, with the latter forming the basis of its implementation. End users wishing to understand the intended purpose and function of a rule can ignore the code and focus instead on the rule background and statement.
Regular ExpressionsThe rule tagger was first implemented using regular expressions. While useful for prototyping, this first tagger has proven brittle. It is difficult to maintain, and easy to corrupt. Moreover, regular expressions tend to exclude non-technical users, diminishing any realistic hope of getting others involved in the process of refining the tagger.
We continue to update the regular expressions tagger, but we anticipate that it will eventually be entirely superseded by the constraint grammar tagger. Constraint grammar was specifically designed for use by linguists and computer professionals for language analysis, with rules that are much easier to read and maintain.
Please note that the rules below and the taggers available for download from this site are the most up to date versions that we have. Occasionally, a rule modification will break the tagger. We usually fix broken rules quickly, so if the tagger doesn't work for you, check back later and download a new tagger.
The regex tagger consists of a sequence of rules, applied in order to a horizontal text. Each rule consists of two parts, the pattern (before the < symbol) and the replacement (after the < symbol).
PATTERN > REPLACE
The horizontal taggings available for download here are the result of applying the regex tagger to hand-segmented text. One use of these outputs might be as input to a statistical tagger charged with the task of further reducing pos tagging ambiguity.
VISL CG is a C++ application that should be compiled and built for your specific platform. To install the software on your machine, follow these detailed instructions.
Next, download and compile the grammar using the cg-comp command.
cg-comp 2014-10-31-cg3-tagger.txt cg3-tagger.cg
Finally, use the vislcg3 command to apply the compiled tagger to a lex tagged output in VISL CG format. For example, the following command will apply the tagger to the lex tagged མཛངས་བླུན་ཞེས་བྱ་བའི་མདོ།, assuming the tagger, the input file, and the output file are all in your current working directory.
vislcg3 -g cg3-tagger.cg -I lex_vislcg_74.txt -O cg3-74.txt
The CG POS tagger is provided in two flavours. The word tagger takes as input a sequence of word cohorts. The syllable tagger takes as input a sequence of syllable cohorts.
Each corpus has a dynamically updated scoring page that measures performance along several dimensions. It is assumed that each token of a text has one and only one correct tag, but that a tagger will not always settle on a single tag for each token. Accuracy measures how often a tagger includes a token's correct tag as one of its possible tags. Ambiguity measures the average number of tags the tagger assigns to tokens. Finally, we can count the number of times a tagger is correct, that is, the number of times it assigns the correct tag to a token, without assigning any other possible tags.
Ideally, a tagger will be correct for every token. If so, it will score 1 (or 100%) for accuracy, and 1 (or 1 average tag per word) for ambiguity. Clearly, there is little value in a perfect score on one dimension, if the other score is poor.
All software written as part of this project is being released under an open source license, and is available at Tibetan NLP, a GitHub page collecting tools and resources related to natural language processing of Tibetan. The page also has contributions from the Tibetan Buddhist Resource Center and the Tibetan & Himalayan Library.