tasal9's picture
Update README.md
955f0af verified
|
raw
history blame
2.21 kB
metadata
license: apache-2.0
datasets:
  - tasal9/Pashto_Dataset
language:
  - ps
  - en
library_name: sentence-transformers
tags:
  - multilingual
  - embeddings
  - semantic-search
  - pashto
  - chromadb
  - llamaindex
pipeline_tag: feature-extraction
model-index:
  - name: Multilingula-ZamAI-Embeddings
    results: []

ZamAI Multilingual Embeddings

This directory contains tools and utilities for working with multilingual embedding models, with a focus on Pashto language support. The embeddings enable semantic search, document retrieval, and other natural language processing tasks across multiple languages.

Model Information

Directory Structure

embeddings/
├── setup.py # Setup script for the embeddings model and vector store
├── demo.py # Demo application with Gradio web UI
├── indexer.py # Utility for indexing new documents
├── requirements.txt # Dependencies for the embeddings components
└── chroma_db/ # Directory for the vector database (created on first run)

Getting Started

  1. Install the dependencies:

pip install -r models/embeddings/requirements.txt

  1. Add documents to index:

Place your text files in the data/text_corpus directory

python models/embeddings/indexer.py --corpus data/text_corpus/

  1. Run the demo application:

python models/embeddings/demo.py

Using the Embeddings in Your Code

from models.embeddings.setup import setup_embedding_model

Initialize the model and related components

embedding_components = setup_embedding_model()

Get the query engine

query_engine = embedding_components["query_engine"]

Query in any language

result = query_engine.query("What is the capital of Afghanistan?")

Or in Pashto

result = query_engine.query("د افغانستان پلازمېنه څه ده؟")

print(result)