UgannA Siyabasa β FastText Sinhala Embedding Model π±π°
Note : This is a demo version of the model and we will drop the final Model soon.
UgannA Siyabasa (ΰΆΰΆΰΆ±ΰ·ΰΆ±ΰ· ΰ·ΰ·ΰΆΊΰΆΆΰ·) is the first public FastText embedding model released by Remeinium Corp. The name comes from Kumaratunga Munidasaβs timeless quote:
βΰΆΰΆΰΆ±ΰ·ΰΆ±ΰ· ΰ·ΰ·ΰΆΊΰΆΆΰ· β ΰΆΈΰΆΰ· ΰ·ΰΆ±ΰ·ΰΆ±ΰ· ΰΆΰ·ΰ· ΰΆ»ΰ·ΰΆΊΰ·ΰΆ±ΰ·β Learn Sinhala β be intoxicated with its beauty.
Just as Munidasa envisioned nurturing the Sinhala language, this model represents teaching it to machines.
π Key Features
- Type: FastText (official library)
- Vector size: 100 dimensions
- File size: ~1.56GB
- Training data: 6.2GB processed Sinhala text
π§ Usage
import fasttext
# Load the model
model = fasttext.load_model("Remeinium/UgannA_Siyabasa/UgannA_Siyabasa.bin")
# Get vector for a word
vector = model.get_word_vector("ΰΆ
ΰΆΈΰ·ΰΆΈΰ·")
# Get nearest neighbors
neighbors = model.get_nearest_neighbors("ΰΆ
ΰΆΈΰ·ΰΆΈΰ·", k=10)
print(neighbors)
π Training Data
- Processed and cleaned training corpus: ~6.2GB
- Preprocessing: tokenization, normalization, deduplication
ποΈ License
This model is released under CC BY-NC 4.0 (non-commercial use). π For commercial usage, please contact: [email protected]
β οΈ Limitations
- Vocabulary coverage limited to training dataset.
- May reflect cultural/linguistic biases from sources.
- Optimized for Sinhala; not multilingual (future versions will expand).
π€ Collaboration
You are welcome to:
- Use this model for research & personal projects
- Share improvements, benchmarks, or downstream applications
Contact : π§ [email protected]
- Downloads last month
- -