tasal9
/

Multilingual-ZamAI-Embeddings

Sentence Similarity

sentence-transformers

feature-extraction

semantic-search

text-embeddings-inference

Model card Files Files and versions

Multilingual-ZamAI-Embeddings / README.md

tasal9's picture

Update README.md

955f0af verified 5 months ago

|

2.21 kB

	---
	license: apache-2.0
	datasets:
	- tasal9/Pashto_Dataset
	language:
	- ps
	- en
	library_name: sentence-transformers
	tags:
	- multilingual
	- embeddings
	- semantic-search
	- pashto
	- chromadb
	- llamaindex
	pipeline_tag: feature-extraction
	model-index:
	- name: Multilingula-ZamAI-Embeddings
	results: []
	---
	# ZamAI Multilingual Embeddings

	This directory contains tools and utilities for working with multilingual embedding models, with a focus on Pashto language support. The embeddings enable semantic search, document retrieval, and other natural language processing tasks across multiple languages.

	## Model Information

	- Base Model: [sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2)
	- Languages Supported: 50+ including Pashto, English, Arabic, Urdu, Farsi, and more
	- Vector Database: ChromaDB
	- Integration Framework: LlamaIndex

	## Directory Structure

	embeddings/
	├── setup.py # Setup script for the embeddings model and vector store
	├── demo.py # Demo application with Gradio web UI
	├── indexer.py # Utility for indexing new documents
	├── requirements.txt # Dependencies for the embeddings components
	└── chroma_db/ # Directory for the vector database (created on first run)

	Getting Started

	1. Install the dependencies:

	pip install -r models/embeddings/requirements.txt


	2. Add documents to index:

	# Place your text files in the data/text_corpus directory
	python models/embeddings/indexer.py --corpus data/text_corpus/


	3. Run the demo application:

	python models/embeddings/demo.py



	Using the Embeddings in Your Code

	from models.embeddings.setup import setup_embedding_model

	# Initialize the model and related components
	embedding_components = setup_embedding_model()

	# Get the query engine
	query_engine = embedding_components["query_engine"]

	# Query in any language
	result = query_engine.query("What is the capital of Afghanistan?")
	# Or in Pashto
	result = query_engine.query("د افغانستان پلازمېنه څه ده؟")

	print(result)