About the project
The work on the Corpus of Tatar texts was started in 2010. The beginnings of the project were connected with Authors' discussions about two directions of research:
By studying the relevant literature we became aware that modern systems of MT and automatic recognition of speech rely on national corpora of the languages in question, applying the “hypothesis — check” method. This fact urged us to commit ourselves to the creation of a similar corpus of the Tatar language.
The Corpus of Written Tatar is mainly based on materials available in the web. Following the web addresses given after the examples (sentences) in the search results, the user can obtain more information about the sites used in creating the corpus.
The texts originating from different sources have been processed before including them in the Corpus of Tatar language: hmtl-tags have been deleted, sentences in foreign languages have been removed, the encoding of the texts has been converted into utf-8, the sentence borders have been automatically added to the material, etc.
The work on collecting materials and processing them is going on. After having learned about the existence of the Corpus of written Tatar, many writers and scholars have provided us with electronic versions of their books and articles. According to our practice, we update the published version of the Tatar corpus when the word count of newly acquired contributions reaches 5-6 million word occurrences. At the same time, the user interface is updated.
The Corpus of Written Tatar can also be regarded as an enormous reference book, giving the user an orderly view into the world of the Tatar language.
Using the Corpus of Written Tatar is free of charge.
In order to adequately represent the Tatar language and to be called the national corpus of the Tatar language, corpus should contain no less than 100 million word occurrences. We have achieved this amount in 2014.The Corpus of Written Tatar was created by the following people: