|
|
--- |
|
|
language: |
|
|
- en |
|
|
- kn |
|
|
tags: |
|
|
- language |
|
|
- kannada |
|
|
license: mit |
|
|
base_model: |
|
|
- text2font/ByteLevelBPETokenizer_default |
|
|
--- |
|
|
|
|
|
# Kannada ByteLevel BPE Tokenizer |
|
|
|
|
|
This repository contains a **Byte-Level BPE (Byte Pair Encoding) tokenizer** for the **Kannada** language, designed using the Hugging Face `tokenizers` library. The tokenizer is optimized for handling Kannada text, including **pure Kannada, mixed-language text, numbers, punctuation, and special cases like URLs and emojis**. |
|
|
|
|
|
## Features |
|
|
- **ByteLevel BPE Tokenization**: Ensures efficient and compact subword representation. |
|
|
- **Special Tokens Support**: Includes special tokens for sentence boundaries, padding, and masking. |
|
|
- **Post-processing similar to BERT**: Helps structure input data effectively for downstream models. |
|
|
- **Handles Kannada and mixed-language text**: Can tokenize Kannada-English text seamlessly. |
|
|
|
|
|
## Usage |
|
|
This tokenizer is pre-trained on a diverse Kannada text corpus and can be used for a variety of NLP tasks such as text classification, language modeling, and machine translation. |
|
|
|
|
|
## Training Details |
|
|
The tokenizer was trained on a large Kannada corpus, including Wikipedia articles, literary texts, and online content, ensuring a broad vocabulary coverage. |
|
|
|
|
|
## Supported Test Cases |
|
|
The tokenizer has been tested on multiple text categories: |
|
|
- **Pure Kannada**: Simple and complex Kannada sentences. |
|
|
- **Mixed Language**: Kannada-English hybrid text. |
|
|
- **Numbers and Dates**: Kannada and Western numerals, date formats, and currency symbols. |
|
|
- **Punctuation Handling**: Sentences with punctuation and special characters. |
|
|
- **Special Cases**: URLs, hashtags, emojis, and file paths. |
|
|
|
|
|
## Tokenizer Test Results |
|
|
|
|
|
### Category: Pure Kannada |
|
|
|
|
|
#### Test Case 1: Basic sentence |
|
|
**Original text:** ನಮಸ್ಕಾರ ಕನ್ನಡ ಭಾಷೆ |
|
|
**Encoded tokens:** `['<s>', 'ನಮಸ', 'à³į', 'à²ķ', 'ಾ', 'ರ', 'Ġà²ķನ', 'à³į', 'ನಡ', 'Ġà²Ń', 'ಾ', 'ಷ', 'à³Ĩ', '</s>']` |
|
|
**Token IDs:** `[0, 1461, 264, 278, 270, 272, 738, 264, 407, 386, 270, 323, 268, 1]` |
|
|
**Decoded text:** ನಮಸ್ಕಾರ ಕನ್ನಡ ಭಾಷೆ |
|
|
**Analysis:** |
|
|
- **Number of tokens:** 14 |
|
|
- **Average token length:** 1.29 characters |
|
|
- **Reconstruction:** Perfect |
|
|
|
|
|
#### Test Case 2: Complex sentence |
|
|
**Original text:** ಕನ್ನಡ ನಾಡಿನ ಸಂಸ್ಕೃತಿ ಮತ್ತು ಪರಂಪರೆ |
|
|
**Encoded tokens:** `['<s>', 'à²ķನ', 'à³į', 'ನಡ', 'Ġನ', 'ಾ', 'ಡ', 'ಿ', 'ನ', 'Ġಸ', 'à²Ĥ', 'ಸ', 'à³į', 'à²ķ', 'à³ĥ', 'ತ', 'ಿ', 'Ġಮತ', 'à³į', 'ತ', 'à³ģ', 'Ġಪರ', 'à²Ĥ', 'ಪರ', 'à³Ĩ', '</s>']` |
|
|
**Token IDs:** `[0, 754, 264, 407, 298, 270, 280, 267, 266, 300, 275, 281, 264, 278, 412, 271, 267, 382, 264, 271, 265, 360, 275, 524, 268, 1]` |
|
|
**Decoded text:** ಕನ್ನಡ ನಾಡಿನ ಸಂಸ್ಕೃತಿ ಮತ್ತು ಪರಂಪರೆ |
|
|
**Analysis:** |
|
|
- **Number of tokens:** 26 |
|
|
- **Average token length:** 1.27 characters |
|
|
- **Reconstruction:** Perfect |
|
|
|
|
|
### Category: Mixed Language |
|
|
|
|
|
#### Test Case 1: Kannada with English |
|
|
**Original text:** ನನ್ನ email ID ಇದು [email protected] ಆಗಿದೆ |
|
|
**Encoded tokens:** `['<s>', 'ನನ', 'à³į', 'ನ', 'Ġ', 'e', 'm', 'a', 'i', 'l', 'Ġ', 'I', 'D', 'Ġà²ĩದ', 'à³ģ', 'Ġ', 'e', 'x', 'a', 'm', 'p', 'l', 'e', '@', 'e', 'm', 'a', 'i', 'l', '.', 'com', 'Ġà²Ĩà²Ĺ', 'ಿ', 'ದ', 'à³Ĩ', '</s>']` |
|
|
**Token IDs:** `[0, 306, 264, 266, 225, 73, 81, 69, 77, 80, 225, 45, 40, 493, 265, 225, 73, 92, 69, 81, 84, 80, 73, 36, 73, 81, 69, 77, 80, 18, 469, 408, 267, 269, 268, 1]` |
|
|
**Decoded text:** ನನ್ನ email ID ಇದು [email protected] ಆಗಿದೆ |
|
|
**Analysis:** |
|
|
- **Number of tokens:** 36 |
|
|
- **Average token length:** 1.14 characters |
|
|
- **Reconstruction:** Perfect |
|
|
|
|
|
## Repository Structure |
|
|
The repository consists of tokenizer files, configuration files, and documentation: |
|
|
``` |
|
|
├── kannada_tokenizer/ |
|
|
│ ├── vocab.json |
|
|
│ ├── merges.txt |
|
|
│ ├── tokenizer.json |
|
|
│ ├── tokenizer_config.json |
|
|
│ ├── special_tokens_map.json |
|
|
│ ├── config.json |
|
|
├── README.md |
|
|
``` |
|
|
|
|
|
## License |
|
|
This tokenizer is released under the **MIT License**. |
|
|
|
|
|
## Citation |
|
|
If you use this tokenizer, please cite this repository. |
|
|
|
|
|
For improvements and contributions, feel free to submit a **Pull Request** or open an **Issue**. 🚀 |
|
|
|