BulTreeBank WordNet (BTB-WN)

BTB-WordNet Versions

The creation of BulTreeBank WordNet has a long history. It started initially as Bulgarian domain lexicons aligned to domain and upper ontologies used with the following European projects: LT4ELAsIsKnown. The utility of these lexicons for semantic annotation of domain texts, and some other NLP tasks motivated us to start our own work on Bulgarian WordNet.

BTB-WordNet 4.0

When CLaDA-BG project started at the end of 2018 it contained a little more than 19000 synsets. During CLaDA-BG project 2019, 2020, 2021 and beginning of 2022 BTB-WN was checked by two people for consistency, the definitions were improved, it was mapped to Bulgarian Wikipedia, for many synsets new examples were added.

Thus, the current version of BTB-WN - 4.0 - was thoroughly revised using a specified software and extended with more than 10 000 new senses, so currently it contains more than 30 000 synsets.
Several explanatory dictionaries were consulted about the number of word senses in BTB-WN, definitions, etc.

Parts-of-speech Representation

BTB-WN currently includes representations of the following four parts of speech: 

Nouns

Grammar

Nouns that are defective with respect to the number category (used only in singular (прах, “prah”, dust) or only in plural (анали, “anali”, annals), always have respectively only singular lemmas and only plural lemmas. The information about their usage is presented in a lemma marker.

Adjectives

Grammar

In this category are also included participles which function as adjectives under two conditions - a participle is either independently presented in the dictionaries or it is determined as a synonym of an adjective.

Here is an example: in the synset with дебел, “debel”, fat are added a few participles as synonyms: охранен, “ohranen”, угоен, “ugoen” respectively from the verbs охраня, “ohranya” and угоя, “ugoya” (make a person/animal to gain weight with a rich diet), and хранен, “hranen” from храня, “hranja”, feed.

Ordinal numerals (for example, трети, “treti”, third) are presented with the adjective category in BTB-WN as it is done in the OEW.

Adverbs

Grammar

The two types of Bulgarian adverbs are presented in BTB-WN - regular (derived from nouns, adjectives, numerals, verbs, prepositions, for example бързо, bǎrzo, quickly) and pronominal adverbs (derived from pronouns, for example тук, tuk, here).

Verbs

Grammar

Impersonal verbs have lemmas in third person singular. Here is an example: оказва се, “okazva se”, окаже се, “okaže se”, turn out, prove, turn up are in third person singular.

Pronouns, prepositions, conjunctions, particles and interjections are considered to be added in the future.

WordNet Tree

Mappings

Open English Wordnet

Mapping with
Open English Wordnet (OEW)

BTB-WN is mapped first with the Princeton WordNet 3.0 and later also with the OEW.

The mapping process starts with translation of a Bulgarian term in English, then search for the corresponding English synset and establishment of relation between the two synsets, addition of Bulgarian definition and examples.

Since 2020 BTB-WN is mapped to the OEW and the main benefits of the mapping are that this wordnet is being updated, edited and expanded (unlike PWN).

This image for Image Layouts addon

Mapping with
Wikipedia

Two types of extension of BTB-WN were intended - extension of the existing lemmas with new senses and extension with instances.

For the first task all the lemmas in BTB-WN were compared with the titles of Bulgarian Wikipedia articles and the senses from Wikipedia, which were missing in BTB-WN, were added in the wordnet with definitions and links to Wikipedia.

The titles of the corresponding English Wikipedia articles were also extracted and used for the selection of right sense in English and thus, an appropriate synset in EWN for mapping.

This image for Image Layouts addon

Mapping with
DBpedia

The mapping with DBpedia (an open knowledge graph with information from Wikimedia) was used for the second type of BTB-WN extension - with instances.

Named entities in BulTreeBank are annotated with URIs from DBpedia and because Bulgarian DBpedia is relatively small, Bulgarian Wikipedia was also used.

So far the mapping is done with the three most frequent types of Named entities in the BulTreeBank - people, locations and organisations.

Sources

Here we list the main sources with which we consult during the creation of BTB-WN in order to do the determination of the possible senses, the formulation of definitions, examples, mapping to English Princeton WordNet and to English Open WordNet.

  1. Multivolume dictionary of Bulgarian language
  2. L. Andrejčin, et al. Bulgarian explanatory dictionary. IV edition, supplemented and revised by D. Popov. Nauka i izkustvo, 1994
  3. E. Perniška, D. Blagoeva and S. Kolkovska. Dictionary of the new words in Bulgarian language, Nauka i izkustvo, 2010, 2021
  4. D. Blagoeva and S. Kolkovska. Dictionary of the new words in Bulgarian language, Nauka i izkustvo, 2021
  5. I. Kasabov and K. Stojanov. Universal encyclopaedic dictionary, Svidas, 1999, 2003
  6. A. Nanova. Bulgarian synonymy and antonymy dictionary with idioms, Prosveta, 2019
  7. Online dictionary
  8. English Princeton WordNet
  9. English Open WordNet
  10. Wikipedia
  11. Wiktionary 

In the process of the creation of BTB-WN we also consult the concordances over several Bulgarian corpora:

  1. Bulgarian HPSG-based TreeBank
  2. Bulgarian National Reference Corpus - BulTreeBank
  3. CLaDA-BG Multi Billion Corpus 
Image

EU Context and Financial Support

Image
Image
Image