Pretrained Large Language Models for Classification Tasks in Bulgarian

Three freely available Large Language Models for the Bulgarian language, following the BERT architecture (Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, abs/1810.04805.) have been uploaded to the Hugging Face platform page:

BERT-Base (109 million parameters) –
https://huggingface.co/AIaLT…/bert_bg_lit_web_base_uncased

BERT-Large (334 million parameters) –
https://huggingface.co/AIaLT…/bert_bg_lit_web_large_uncased

BERT-Extra Large (657 million parameters) –
https://huggingface.co/…/bert_bg_lit_web_extra_large_uncased

These models can be used for fine-tuning to various classification tasks. Especially within CLaDA-BG they have been used for fine-tuning to basic tasks for processing Bulgarian texts: grammatical annotation, lemmatization, name entity recognition, dependent syntactic analysis, word meaning annotation, sentence segmentation. The main application of those processing within CLaDA-BG is the extraction of knowledge from large text arrays.

And here we offer you two articles that present specific applications of these models for text analysis. One deals with annotating words with grammatical characteristics, and the other deals with connecting words in the text with their meanings:

Paev, N., Simov, K., Osenova, P. Introducing Shallow Syntactic Information within the Graph-based Dependency Parsing. In Proceedings of TLT 2024. Hamburg, Germany. https://aclanthology.org/2024.tlt-1.6/ , [pdf]

Paev, N., Simov, K., & Osenova, P. (2025). Word Sense Disambiguation with Large Language Models: Casing Bulgarian. In Global WordNet Conference 2025. [pdf]