kannada-tokenizer / README.md

ruthuvikas1998

Update README.md

01d3c74 verified 9 months ago

4.56 kB

	---
	language:
	- en
	- kn
	tags:
	- language
	- kannada
	license: mit
	base_model:
	- text2font/ByteLevelBPETokenizer_default
	---

	# Kannada ByteLevel BPE Tokenizer

	This repository contains a Byte-Level BPE (Byte Pair Encoding) tokenizer for the Kannada language, designed using the Hugging Face `tokenizers` library. The tokenizer is optimized for handling Kannada text, including pure Kannada, mixed-language text, numbers, punctuation, and special cases like URLs and emojis.

	## Features
	- ByteLevel BPE Tokenization: Ensures efficient and compact subword representation.
	- Special Tokens Support: Includes special tokens for sentence boundaries, padding, and masking.
	- Post-processing similar to BERT: Helps structure input data effectively for downstream models.
	- Handles Kannada and mixed-language text: Can tokenize Kannada-English text seamlessly.

	## Usage
	This tokenizer is pre-trained on a diverse Kannada text corpus and can be used for a variety of NLP tasks such as text classification, language modeling, and machine translation.

	## Training Details
	The tokenizer was trained on a large Kannada corpus, including Wikipedia articles, literary texts, and online content, ensuring a broad vocabulary coverage.

	## Supported Test Cases
	The tokenizer has been tested on multiple text categories:
	- Pure Kannada: Simple and complex Kannada sentences.
	- Mixed Language: Kannada-English hybrid text.
	- Numbers and Dates: Kannada and Western numerals, date formats, and currency symbols.
	- Punctuation Handling: Sentences with punctuation and special characters.
	- Special Cases: URLs, hashtags, emojis, and file paths.

	## Tokenizer Test Results

	### Category: Pure Kannada

	#### Test Case 1: Basic sentence
	Original text: ನಮಸ್ಕಾರ ಕನ್ನಡ ಭಾಷೆ
	Encoded tokens: `['<s>', 'à²¨à²®à²¸', 'à³į', 'à²ķ', 'à²¾', 'à²°', 'Ġà²ķà²¨', 'à³į', 'à²¨à²¡', 'Ġà²Ń', 'à²¾', 'à²·', 'à³Ĩ', '</s>']`
	Token IDs: `[0, 1461, 264, 278, 270, 272, 738, 264, 407, 386, 270, 323, 268, 1]`
	Decoded text: ನಮಸ್ಕಾರ ಕನ್ನಡ ಭಾಷೆ
	Analysis:
	- Number of tokens: 14
	- Average token length: 1.29 characters
	- Reconstruction: Perfect

	#### Test Case 2: Complex sentence
	Original text: ಕನ್ನಡ ನಾಡಿನ ಸಂಸ್ಕೃತಿ ಮತ್ತು ಪರಂಪರೆ
	Encoded tokens: `['<s>', 'à²ķà²¨', 'à³į', 'à²¨à²¡', 'Ġà²¨', 'à²¾', 'à²¡', 'à²¿', 'à²¨', 'Ġà²¸', 'à²Ĥ', 'à²¸', 'à³į', 'à²ķ', 'à³ĥ', 'à²¤', 'à²¿', 'Ġà²®à²¤', 'à³į', 'à²¤', 'à³ģ', 'Ġà²ªà²°', 'à²Ĥ', 'à²ªà²°', 'à³Ĩ', '</s>']`
	Token IDs: `[0, 754, 264, 407, 298, 270, 280, 267, 266, 300, 275, 281, 264, 278, 412, 271, 267, 382, 264, 271, 265, 360, 275, 524, 268, 1]`
	Decoded text: ಕನ್ನಡ ನಾಡಿನ ಸಂಸ್ಕೃತಿ ಮತ್ತು ಪರಂಪರೆ
	Analysis:
	- Number of tokens: 26
	- Average token length: 1.27 characters
	- Reconstruction: Perfect

	### Category: Mixed Language

	#### Test Case 1: Kannada with English
	Original text: ನನ್ನ email ID ಇದು [email protected] ಆಗಿದೆ
	Encoded tokens: `['<s>', 'à²¨à²¨', 'à³į', 'à²¨', 'Ġ', 'e', 'm', 'a', 'i', 'l', 'Ġ', 'I', 'D', 'Ġà²ĩà²¦', 'à³ģ', 'Ġ', 'e', 'x', 'a', 'm', 'p', 'l', 'e', '@', 'e', 'm', 'a', 'i', 'l', '.', 'com', 'Ġà²Ĩà²Ĺ', 'à²¿', 'à²¦', 'à³Ĩ', '</s>']`
	Token IDs: `[0, 306, 264, 266, 225, 73, 81, 69, 77, 80, 225, 45, 40, 493, 265, 225, 73, 92, 69, 81, 84, 80, 73, 36, 73, 81, 69, 77, 80, 18, 469, 408, 267, 269, 268, 1]`
	Decoded text: ನನ್ನ email ID ಇದು [email protected] ಆಗಿದೆ
	Analysis:
	- Number of tokens: 36
	- Average token length: 1.14 characters
	- Reconstruction: Perfect

	## Repository Structure
	The repository consists of tokenizer files, configuration files, and documentation:
	```
	├── kannada_tokenizer/
	│ ├── vocab.json
	│ ├── merges.txt
	│ ├── tokenizer.json
	│ ├── tokenizer_config.json
	│ ├── special_tokens_map.json
	│ ├── config.json
	├── README.md
	```

	## License
	This tokenizer is released under the MIT License.

	## Citation
	If you use this tokenizer, please cite this repository.

	For improvements and contributions, feel free to submit a Pull Request or open an Issue. 🚀