--- license: apache-2.0 datasets: - tasal9/Pashto_Dataset language: - ps - en library_name: sentence-transformers tags: - multilingual - embeddings - semantic-search - pashto - chromadb - llamaindex pipeline_tag: feature-extraction model-index: - name: Multilingula-ZamAI-Embeddings results: [] --- # ZamAI Multilingual Embeddings This directory contains tools and utilities for working with multilingual embedding models, with a focus on Pashto language support. The embeddings enable semantic search, document retrieval, and other natural language processing tasks across multiple languages. ## Model Information - **Base Model**: [sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) - **Languages Supported**: 50+ including Pashto, English, Arabic, Urdu, Farsi, and more - **Vector Database**: ChromaDB - **Integration Framework**: LlamaIndex ## Directory Structure embeddings/ ├── setup.py # Setup script for the embeddings model and vector store ├── demo.py # Demo application with Gradio web UI ├── indexer.py # Utility for indexing new documents ├── requirements.txt # Dependencies for the embeddings components └── chroma_db/ # Directory for the vector database (created on first run) Getting Started 1. Install the dependencies: pip install -r models/embeddings/requirements.txt 2. Add documents to index: # Place your text files in the data/text_corpus directory python models/embeddings/indexer.py --corpus data/text_corpus/ 3. Run the demo application: python models/embeddings/demo.py Using the Embeddings in Your Code from models.embeddings.setup import setup_embedding_model # Initialize the model and related components embedding_components = setup_embedding_model() # Get the query engine query_engine = embedding_components["query_engine"] # Query in any language result = query_engine.query("What is the capital of Afghanistan?") # Or in Pashto result = query_engine.query("د افغانستان پلازمېنه څه ده؟") print(result)