license: apache-2.0
datasets:
- tasal9/Pashto_Dataset
language:
- ps
- en
library_name: sentence-transformers
tags:
- multilingual
- embeddings
- semantic-search
- pashto
- chromadb
- llamaindex
pipeline_tag: feature-extraction
model-index:
- name: Multilingula-ZamAI-Embeddings
results: []
ZamAI Multilingual Embeddings
This directory contains tools and utilities for working with multilingual embedding models, with a focus on Pashto language support. The embeddings enable semantic search, document retrieval, and other natural language processing tasks across multiple languages.
Model Information
- Base Model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- Languages Supported: 50+ including Pashto, English, Arabic, Urdu, Farsi, and more
- Vector Database: ChromaDB
- Integration Framework: LlamaIndex
Directory Structure
embeddings/
├── setup.py # Setup script for the embeddings model and vector store
├── demo.py # Demo application with Gradio web UI
├── indexer.py # Utility for indexing new documents
├── requirements.txt # Dependencies for the embeddings components
└── chroma_db/ # Directory for the vector database (created on first run)
Getting Started
- Install the dependencies:
pip install -r models/embeddings/requirements.txt
- Add documents to index:
Place your text files in the data/text_corpus directory
python models/embeddings/indexer.py --corpus data/text_corpus/
- Run the demo application:
python models/embeddings/demo.py
Using the Embeddings in Your Code
from models.embeddings.setup import setup_embedding_model
Initialize the model and related components
embedding_components = setup_embedding_model()
Get the query engine
query_engine = embedding_components["query_engine"]
Query in any language
result = query_engine.query("What is the capital of Afghanistan?")
Or in Pashto
result = query_engine.query("د افغانستان پلازمېنه څه ده؟")
print(result)