|
--- |
|
datasets: |
|
- mc4 |
|
language: |
|
- ka |
|
library_name: transformers |
|
tags: |
|
- general |
|
widget: |
|
- text: "ქართული [MASK] სწავლა საკმაოდ რთულია" |
|
example_title: "Georgian Language" |
|
- text: "საქართველოს [MASK] ნაკრები ერთა ლიგაზე კარგად ასპარეზობს" |
|
example_title: "Football" |
|
- text: "ქართული ღვინო განთქმულია [MASK] მსოფლიოში" |
|
example_title: "Wine" |
|
--- |
|
|
|
# General Georgian Language Model |
|
|
|
This language model is a pretrained model specifically designed to understand and generate text in the Georgian language. It is based on the DistilBERT-base-uncased architecture and has been trained on the MC4 dataset, which contains a large collection of Georgian web documents. |
|
|
|
## Model Details |
|
|
|
- **Architecture**: DistilBERT-base-uncased |
|
- **Pretraining Corpus**: MC4 (Multilingual Crawl Corpus) |
|
- **Language**: Georgian |
|
|
|
## Pretraining |
|
|
|
The model has undergone a pretraining phase using the DistilBERT architecture, which is a distilled version of the original BERT model. DistilBERT is known for its smaller size and faster inference speed while still maintaining a high level of performance. |
|
|
|
During pretraining, the model was exposed to a vast amount of preprocessed Georgian text data from the MC4 dataset. |
|
|
|
## Usage |
|
|
|
To use the General Georgian Language Model, you can utilize the model through various natural language processing (NLP) tasks, such as: |
|
|
|
- Text classification |
|
- Named entity recognition |
|
- Sentiment analysis |
|
- Language generation |
|
|
|
You can fine-tune this model on specific downstream tasks using task-specific datasets or use it as a feature extractor for transfer learning. |
|
|
|
## Example Code |
|
|
|
Here is an example of how to use the General Georgian Language Model using the Hugging Face `transformers` library in Python: |
|
|
|
```python |
|
from transformers import AutoTokenizer, TFAutoModel |
|
from transformers import pipeline |
|
|
|
# Load the tokenizer and model |
|
tokenizer = AutoTokenizer.from_pretrained("Davit6174/georgian-distilbert-mlm") |
|
model = TFAutoModel.from_pretrained("Davit6174/georgian-distilbert-mlm") |
|
|
|
# Build pipeline |
|
mask_filler = pipeline( |
|
"fill-mask", model=model, tokenizer=tokenizer |
|
) |
|
|
|
text = 'ქართული [MASK] სწავლა საკმაოდ რთულია' |
|
|
|
# Generate model output |
|
preds = mask_filler(text) |
|
|
|
# Print top 5 predictions |
|
for pred in preds: |
|
print(f">>> {pred['sequence']}") |
|
``` |
|
|
|
## Limitations and Considerations |
|
- The model's performance may vary across different downstream tasks and domains. |
|
- The model's understanding of context and nuanced meanings may not always be accurate. |
|
- The model may generate plausible-sounding but incorrect or nonsensical Georgian text. |
|
- Therefore, it is recommended to evaluate the model's performance and fine-tune it on task-specific datasets when necessary. |
|
|
|
## Acknowledgments |
|
The Georgian Language Model was pretrained using the Hugging Face transformers library and trained on the MC4 dataset, which is maintained by the community. I would like to express my gratitude to the contributors and maintainers of these valuable resources. |
|
|