|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- tasal9/Pashto_Dataset |
|
|
language: |
|
|
- ps |
|
|
- en |
|
|
library_name: sentence-transformers |
|
|
tags: |
|
|
- multilingual |
|
|
- embeddings |
|
|
- semantic-search |
|
|
- pashto |
|
|
- chromadb |
|
|
- llamaindex |
|
|
pipeline_tag: feature-extraction |
|
|
model-index: |
|
|
- name: Multilingula-ZamAI-Embeddings |
|
|
results: [] |
|
|
--- |
|
|
# ZamAI Multilingual Embeddings |
|
|
|
|
|
This directory contains tools and utilities for working with multilingual embedding models, with a focus on Pashto language support. The embeddings enable semantic search, document retrieval, and other natural language processing tasks across multiple languages. |
|
|
|
|
|
## Model Information |
|
|
|
|
|
- **Base Model**: [sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) |
|
|
- **Languages Supported**: 50+ including Pashto, English, Arabic, Urdu, Farsi, and more |
|
|
- **Vector Database**: ChromaDB |
|
|
- **Integration Framework**: LlamaIndex |
|
|
|
|
|
## Directory Structure |
|
|
|
|
|
embeddings/ |
|
|
├── setup.py # Setup script for the embeddings model and vector store |
|
|
├── demo.py # Demo application with Gradio web UI |
|
|
├── indexer.py # Utility for indexing new documents |
|
|
├── requirements.txt # Dependencies for the embeddings components |
|
|
└── chroma_db/ # Directory for the vector database (created on first run) |
|
|
|
|
|
Getting Started |
|
|
|
|
|
1. Install the dependencies: |
|
|
|
|
|
pip install -r models/embeddings/requirements.txt |
|
|
|
|
|
|
|
|
2. Add documents to index: |
|
|
|
|
|
# Place your text files in the data/text_corpus directory |
|
|
python models/embeddings/indexer.py --corpus data/text_corpus/ |
|
|
|
|
|
|
|
|
3. Run the demo application: |
|
|
|
|
|
python models/embeddings/demo.py |
|
|
|
|
|
|
|
|
|
|
|
Using the Embeddings in Your Code |
|
|
|
|
|
from models.embeddings.setup import setup_embedding_model |
|
|
|
|
|
# Initialize the model and related components |
|
|
embedding_components = setup_embedding_model() |
|
|
|
|
|
# Get the query engine |
|
|
query_engine = embedding_components["query_engine"] |
|
|
|
|
|
# Query in any language |
|
|
result = query_engine.query("What is the capital of Afghanistan?") |
|
|
# Or in Pashto |
|
|
result = query_engine.query("د افغانستان پلازمېنه څه ده؟") |
|
|
|
|
|
print(result) |
|
|
|
|
|
|