metadata
license: apache-2.0
datasets:
- tasal9/Pashto_Dataset
language:
- ps
- en
- ar
- ur
- fa
library_name: sentence-transformers
tags:
- multilingual
- embeddings
- semantic-search
- pashto
- chromadb
- llamaindex
- cross-lingual
- afghanistan
- zamai
pipeline_tag: feature-extraction
model-index:
- name: Multilingual-ZamAI-Embeddings
results: []
widget:
- source_sentence: This is a sample sentence in English.
sentences:
- This sentence is similar to the first one.
- دا جمله د لومړۍ جملې سره ورته ده.
- This sentence has nothing to do with the others.
example_title: English to multilingual similarity
- source_sentence: دا په پښتو کې یوه نمونه جمله ده.
sentences:
- This is a sample sentence in English.
- دا جمله د لومړۍ جملې سره ورته ده.
- زه د پښتو ژبې زده کړه کوم.
example_title: Pashto to multilingual similarity
ZamAI Multilingual Embeddings
This model provides state-of-the-art multilingual sentence embeddings with a special focus on Pashto language support. Built on the foundation of sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2, this model enables semantic search, document retrieval, and cross-lingual understanding across 50+ languages.
Model Details
- Model Type: Sentence Transformer (BERT-based)
- Base Model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- Languages Supported: 50+ including Pashto (ps), English (en), Arabic (ar), Urdu (ur), Farsi (fa), and more
- Max Sequence Length: 512 tokens
- Output Dimensionality: 384
- License: Apache 2.0
Key Features
- Cross-lingual Understanding: Retrieve semantically similar content across different languages
- Pashto Language Support: Optimized for Pashto language processing and understanding
- Vector Database Integration: Ready-to-use with ChromaDB and LlamaIndex
- High Performance: Efficient processing suitable for real-time applications
Usage
Basic Usage with Sentence Transformers
from sentence_transformers import SentenceTransformer
import numpy as np
# Load the model
model = SentenceTransformer('tasal9/Multilingual-ZamAI-Embeddings')
# English sentences
sentences_en = [
"This is a sample sentence in English.",
"This sentence is similar to the first one."
]
# Pashto sentences
sentences_ps = [
"دا په پښتو کې یوه نمونه جمله ده.",
"دا جمله د لومړۍ جملې سره ورته ده."
]
# Get embeddings
embeddings_en = model.encode(sentences_en)
embeddings_ps = model.encode(sentences_ps)
# Calculate cross-lingual similarity
from numpy import dot
from numpy.linalg import norm
def cosine_similarity(a, b):
return dot(a, b) / (norm(a) * norm(b))
# Compare English and Pashto sentences
similarity = cosine_similarity(embeddings_en[0], embeddings_ps[0])
print(f"Cross-lingual similarity: {similarity:.4f}")
Advanced Usage with ChromaDB and LlamaIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext, VectorStoreIndex
import chromadb
# Initialize the embedding model
embed_model = HuggingFaceEmbedding(model_name="tasal9/Multilingual-ZamAI-Embeddings")
# Set up ChromaDB
chroma_client = chromadb.PersistentClient(path="./chroma_db")
collection = chroma_client.get_or_create_collection("multilingual_collection")
vector_store = ChromaVectorStore(chroma_collection=collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# Create index with your documents
# index = VectorStoreIndex.from_documents(documents, storage_context=storage_context, embed_model=embed_model)
# query_engine = index.as_query_engine()
# Query in any language
# result = query_engine.query("What is the capital of Afghanistan?")
# result_ps = query_engine.query("د افغانستان پلازمېنه څه ده؟")
Performance
The model demonstrates excellent cross-lingual performance:
- English-English: High semantic similarity detection
- Pashto-Pashto: Native language understanding and similarity
- Cross-lingual (English-Pashto): Strong cross-lingual semantic alignment
- Multilingual: Supports 50+ languages with consistent performance
Applications
- Semantic Search: Find relevant documents across multiple languages
- Cross-lingual Information Retrieval: Retrieve Pashto content using English queries and vice versa
- Document Similarity: Compare documents in different languages
- Question Answering: Build multilingual QA systems
- Content Recommendation: Recommend similar content across languages
Technical Details
- Architecture: BERT-based transformer model
- Training Data: Multilingual parallel and monolingual corpora
- Optimization: Optimized for semantic similarity tasks
- Integration: Compatible with Hugging Face Transformers, Sentence Transformers, LlamaIndex, and ChromaDB
Citation
If you use this model in your research, please cite:
@misc{zamai-multilingual-embeddings-2024,
title={ZamAI Multilingual Embeddings: Cross-lingual Sentence Transformers with Pashto Support},
author={ZamAI Team},
year={2024},
url={https://huggingface.co/tasal9/Multilingual-ZamAI-Embeddings}
}
License
This model is released under the Apache 2.0 License. See the LICENSE file for details.
Contact
For questions or support, please open an issue on the model repository or contact the ZamAI team.