The creation of BulTreeBank WordNet has a long history. It started initially as Bulgarian domain lexicons aligned to domain and upper ontologies used with the following European projects: LT4EL, AsIsKnown. The utility of these lexicons for semantic annotation of domain texts, and some other NLP tasks motivated us to start our own work on Bulgarian WordNet.
When CLaDA-BG project started at the end of 2018 it contained a little more than 19000 synsets. During CLaDA-BG project 2019, 2020, 2021 and beginning of 2022 BTB-WN was checked by two people for consistency, the definitions were improved, it was mapped to Bulgarian Wikipedia, for many synsets new examples were added.
Thus, the current version of BTB-WN - 4.0 - was thoroughly revised using a specified software and extended with more than 10 000 new senses, so currently it contains more than 30 000 synsets.
Several explanatory dictionaries were consulted about the number of word senses in BTB-WN, definitions, etc.
Version 3.0 of BTB-WN contained about 12500 synsets. Both versions were created and used within the QTLeap European project. After the QTLeap project the vocabulary of BTB-WN was extended on the basis of frequency list of lemmas from Bulgarian National Reference Corpus - BulTreeBank.
Version 2.0 of BTB-WN contained about 9000 synsets; by sense extension, which includes two activities:
- detection of the missing senses of processed lemmas in BulTreeBank and adding them to the BTB-WN
- a semi-automatic extraction of information from the Bulgarian Wiktionary mapped to synsets from PWN and then manually checked.
After checking a little more than 5000 of them were approved and added in BTB-WN. We would like to thank Antoni Oliver Gonzalez who provided the automatic mapping from Bulgarian Wiktionary to PWN. Behind this extension we added new senses for the words that have been already included in synsets of BTB-WN. The idea is for each word to represent all its senses.
BulTreeBank WordNet (BTB-WN 1.0) has been created by manual translation of English synsets from Core WordNet subset of Princeton WordNet (5000 more frequently used word senses) into Bulgarian.
This step ensures comparable coverage between the two WordNets on the most frequent senses. The translation was done by two people with excellent knowledge of English. First, they formulated a Bulgarian definition reflecting the content of the concept represented by its correspondence to the English synset. Then they formed the Bulgarian synset recording the Bulgarian lemmas that have this meaning. Some of the lemmas might be multiword expressions.
After this first phase a lexicographer checked both - the definition and the lemmas. The result from this work was published as part of the Open Multilingual WordNet under CCBY 3.0 licence.
Version 1.0 of BTB-WN by identification of senses used in Bulgarian treebank BulTreeBank (BTB). The identified senses have been organised in synsets for the BulTreeBank WordNet. The newly created Bulgarian synsets are being mapped onto the conceptual structure of PWN. In this way, the BTB-WN was extended with real usages of the word meanings in texts. Also, the coverage of the core and base concepts for Princeton WordNet has been evaluated over a Bulgarian syntactic corpus.
 Synset is a structure for the lexical entries in WordNet consisting of a set of synonyms related to the same sense, a definition of the sense, examples of usage of the synonyms.
Nouns that are defective with respect to the number category (used only in singular (прах, “prah”, dust) or only in plural (анали, “anali”, annals), always have respectively only singular lemmas and only plural lemmas. The information about their usage is presented in a lemma marker.
- Professions, roles, titles and ranks for men and women are united in one synset, which has equivalent-to relation with the EWN synset for the noun for men and a near-equivalent-to relation to the EWN synset for women if such is present. Here is an example: сервитьор, “servitjor”, waiter and сервитьорка, “servitjorka”, waitress are in one synset that has equivalent-to relation with waiter and near-equivalent-to with waitress.
- Another approach is taken towards the nouns for male and female relatives and for male and female animals - they belong to different synsets. Here is an example: баща, “bašta”, father and майка “majka”, mother are in separate synsets and each has equivalent-to relation with respectively father and mother from EWN. Both баща and майка have the synset родител, “roditel”, parent as a hypernym.
- Nouns for young animals are presented as synset members of the general meaning of the given animal. Here is an example: овен, “oven”, ram, овца, “ovca”, sheep and агне, “agne”, lamb are in one synset with equivalent-to relation with sheep and near-equivalent-to relation with ram and lamb.
- Nouns for male and female title and rank holders are members of one synset. Here is an example: дон, “don”, Don and доня, “donja”, донa, “dona”, Dona are in one synset with equivalent-to relation with Don and near-equivalent-to relation with
- Forms of addressing men and women are in one synset. Here is an example: батко, “batko” (used for addressing older men) and кака, “kaka” (used for addressing older women) are in one synset which has обръщение, “obrǎštenie” (address) as a hypernym.
Meanwhile an exception are господин, “gospodin”, Mister and госпожа, “gospoža”, Mrs. which are in separate synsets and each has equivalent-to relation respectively with and Mrs. Both господин and госпожа have обръщение “obrǎštenie” (address) as a hypernym.
- Diminutives are members in the synset of the general form of the given word. Here is an example: стол, “stol”, chair and столче, “stolče”, chair-diminutive are in one synset.
In this category are also included participles which function as adjectives under two conditions - a participle is either independently presented in the dictionaries or it is determined as a synonym of an adjective.
Here is an example: in the synset with дебел, “debel”, fat are added a few participles as synonyms: охранен, “ohranen”, угоен, “ugoen” respectively from the verbs охраня, “ohranya” and угоя, “ugoya” (make a person/animal to gain weight with a rich diet), and хранен, “hranen” from храня, “hranja”, feed.
Ordinal numerals (for example, трети, “treti”, third) are presented with the adjective category in BTB-WN as it is done in the OEW.
Both types of Bulgarian adjectives - qualitative (which express intrinsic properties and qualities of an object, for example красив, “krasiv”, beautiful) and relative (which reflect qualities and properties of objects in relation to another object, for example правоъгълен, “pravoǎgǎlen”, rectangular) - are included in BTB-WN.
The two types of Bulgarian adverbs are presented in BTB-WN - regular (derived from nouns, adjectives, numerals, verbs, prepositions, for example бързо, bǎrzo, quickly) and pronominal adverbs (derived from pronouns, for example тук, tuk, here).
Adverbs from all semantic types are included in BTB-WN - qualitative, quantitative, purpose, locative, temporal, etc.
Impersonal verbs have lemmas in third person singular. Here is an example: оказва се, “okazva se”, окаже се, “okaže se”, turn out, prove, turn up are in third person singular.
Prefixed verbs with semantics for beginning, end, duration, etc. of the action are synset members to the general form of the given verb.
Here is an example: чета, “četa”, read is in one synset with зачета, “začeta”, зачитам, “začitam”, start to read, попрочета, “popročeta”, попрочитам, “popročitam”, read a little, partly, пречета, “prečeta”, пречитам, “prečitam”, read again and so on.
Pronouns, prepositions, conjunctions, particles and interjections are considered to be added in the future.
Open English Wordnet (OEW)
BTB-WN is mapped first with the Princeton WordNet 3.0 and later also with the OEW.
The mapping process starts with translation of a Bulgarian term in English, then search for the corresponding English synset and establishment of relation between the two synsets, addition of Bulgarian definition and examples.
Since 2020 BTB-WN is mapped to the OEW and the main benefits of the mapping are that this wordnet is being updated, edited and expanded (unlike PWN).
Two types of extension of BTB-WN were intended - extension of the existing lemmas with new senses and extension with instances.
For the first task all the lemmas in BTB-WN were compared with the titles of Bulgarian Wikipedia articles and the senses from Wikipedia, which were missing in BTB-WN, were added in the wordnet with definitions and links to Wikipedia.
The titles of the corresponding English Wikipedia articles were also extracted and used for the selection of right sense in English and thus, an appropriate synset in EWN for mapping.
The mapping with DBpedia (an open knowledge graph with information from Wikimedia) was used for the second type of BTB-WN extension - with instances.
Named entities in BulTreeBank are annotated with URIs from DBpedia and because Bulgarian DBpedia is relatively small, Bulgarian Wikipedia was also used.
So far the mapping is done with the three most frequent types of Named entities in the BulTreeBank - people, locations and organisations.
Here we list the main sources with which we consult during the creation of BTB-WN in order to do the determination of the possible senses, the formulation of definitions, examples, mapping to English Princeton WordNet and to English Open WordNet.
- Multivolume dictionary of Bulgarian language
- L. Andrejčin, et al. Bulgarian explanatory dictionary. IV edition, supplemented and revised by D. Popov. Nauka i izkustvo, 1994
- E. Perniška, D. Blagoeva and S. Kolkovska. Dictionary of the new words in Bulgarian language, Nauka i izkustvo, 2010, 2021
- D. Blagoeva and S. Kolkovska. Dictionary of the new words in Bulgarian language, Nauka i izkustvo, 2021
- I. Kasabov and K. Stojanov. Universal encyclopaedic dictionary, Svidas, 1999, 2003
- A. Nanova. Bulgarian synonymy and antonymy dictionary with idioms, Prosveta, 2019
- Online dictionary
- English Princeton WordNet
- English Open WordNet
In the process of the creation of BTB-WN we also consult the concordances over several Bulgarian corpora:
- Bulgarian HPSG-based TreeBank
- Bulgarian National Reference Corpus - BulTreeBank
- CLaDA-BG Multi Billion Corpus