Corpora

Latest versions of resources: https://universaldependencies.org/

Reference corpus of Bulgarian – BulTreeBank (web concordance)

Corpus of Parliamentary and Journalistic Speech (web concordance)

CHILDES Bulgarian LabLing Corpus
DOI: 10.21415/PHWH-J834
THE BULGARIAN LABLING CORPUS, the first Bulgarian children`s speech corpus, has been published to on the CHILDES (Child Language Data Exchange System) platform.
THE BULGARIAN LABLING CORPUS is created by the researchers of the LABLING laboratory of applied linguistics at “Episkop Konstantin Preslavski” University, Shumen, the laboratory being a technological partner in the national CLaDa-BG project.

Bulgarian Treebank Corpus ISLRN: 761-430-854-533-2
The Bulgarian Treebank Corpus is composed of 156,149 tokens (11,138 sentences) coming from three main sources in the domain of Grammar Notebooks (1,391 sentences), News (6,698 sentences), Other (3,049 sentences). It is available with syntactical and morphological annotation on a sentence basis in Universal Dependencies format.

Bulgarian Event Corpus ISLRN: 832-960-876-604-2
The Bulgarian Event Corpus is composed of 324,905 tokens appropriate for training Named Entity Recognition (NER), Named Entity Linking (NEL) and Event Recognition models for Bulgarian in a multidomain context within Humanities. The texts are domain related. They include documents from the area of Social Sciences and Humanities – scientific papers, archive documents, popular documents, and Wikipedia articles in the relevant areas.

Parliamentary Corpora from Phase 1 of ParlaMint project
Multilingual comparable corpora of parliamentary debates ParlaMint 4.0: https://www.clarin.eu/parlamint#parlamint-corpora

Kontext

Corpora

Европейски контекст и финансова подкрепа