Card Metadata (Optional but Recommended)

You can fill these out directly in the Hugging Face UI or here.

Language: ur (Urdu)

Tasks:

- word-embeddings

Library:

- fasttext

Datasets:

- [Specify your dataset name here, e.g., your-dataset-name-on-hf, or just 'Custom Corpus']

Tags:

- urdu

- word-vectors

- embeddings

- fasttext

- unsupervised

- urdu-nlp

License: [Specify your license here, e.g., mit, apache-2.0, cc-by-4.0]


Urdu Word Embeddings (fastText)

Model Description

This is an unsupervised word embedding model for the Urdu language, trained using the fastText library. It generates high-dimensional vectors for Urdu words, capturing semantic and syntactic relationships based on their context in the training data.

Unlike traditional Word2Vec, this fastText model was trained with character n-grams (minn=[Your minn], maxn=[Your maxn]), which is particularly beneficial for morphologically rich languages like Urdu. This allows the model to:

  • Learn representations for subword units.
  • Generate meaningful vectors for words it hasn't seen during training (Out-of-Vocabulary or OOV words) by composing vectors from their character n-grams.

The model outputs vectors of dimension [Your vector_size].

Intended Use

This model is intended for use in various Urdu Natural Language Processing (NLP) tasks, including:

  • Measuring semantic similarity between Urdu words.
  • Using word vectors as features for downstream tasks such as text classification, clustering, or named entity recognition.
  • Exploring word relationships and patterns within the vocabulary learned from the training corpus.
  • Obtaining vector representations for potentially unseen words based on their subword components.

Training Data

This model was trained on a custom text corpus of Urdu sentences.

  • Dataset Source: [Specify the source of your training data here. For example: "Collected from the COUNTER (COrpus of Urdu News TExt Reuse) dataset" or "A custom corpus gathered from [mention sources or domain]"].
  • Data Format: The training data was processed into a single text file (train.txt) where each line represented a sentence or document, and words were separated by spaces.
  • Preprocessing: Basic preprocessing was applied, including replacing common punctuation marks with spaces and normalizing whitespace. [Mention any other specific preprocessing steps you performed, e.g., lowercasing (less common for Urdu), handling numbers, removing specific symbols].

[If your training data is publicly available or derived from a public source, provide a link or instructions on how others can access it.] [If the data is private, state that the data itself cannot be shared but the resulting model is being released.]

Training Procedure

The model was trained using the unsupervised capabilities of the fastText library.

  • Algorithm: Continuous Bag of Words (CBOW) model (model=cbow). [If you used skipgram, specify that instead and briefly explain why, e.g., "Skip-gram model (model=skipgram), often better for capturing representations of rare words."]

  • Parameters: The following parameters were used during training:

    • dim: [Your vector_size] (Vector dimensionality)
    • ws: [Your window_size] (Context window size)
    • minCount: [Your min_word_count] (Minimum word frequency to be included in vocabulary)
    • epoch: [Your epochs] (Number of training epochs)
    • neg: [Your negative_samples] (Number of negative samples)
    • minn: [Your minn] (Minimum character n-gram length)
    • maxn: [Your maxn] (Maximum character n-gram length)
    • thread: 4 (Number of threads used)
    • [List any other significant parameters you modified]
  • Training Environment: The training was performed in a Google Colab environment.

How to Use

You can load and use this model using the fastText Python library.

First, make sure you have fastText installed:

pip install fasttext
import fasttext
import numpy as np # For calculating cosine similarity

# Path to the downloaded .bin model file
model_path = "path/to/your/downloaded/urdu_fasttext.bin"

# Load the fastText model
try:
    model = fasttext.load_model(model_path)
    print("Model loaded successfully!")
except ValueError as e:
    print(f"Error loading model: {e}")
    print("Ensure the file exists and is a valid fastText binary model.")
    model = None # Set model to None if loading fails


if model:
    # --- Get Word Vector ---
    word = "پاکستان" # Example Urdu word
    print(f"\nVector for '{word}':")
    try:
        vector = model.get_word_vector(word)
        print(f"Shape: {vector.shape}")
        print(f"First 10 dimensions: {vector[:10]}")
    except ValueError as e:
        print(f"Error getting vector for '{word}': {e}. Word might be too short or have no valid subwords.")


    # --- Find Nearest Neighbors (Similar Words) ---
    word_for_neighbors = "اردو" # Example Urdu word
    print(f"\nWords similar to '{word_for_neighbors}':")
    try:
        # Get top 10 most similar words
        neighbors = model.get_nearest_neighbors(word_for_neighbors, k=10)
        if neighbors:
            print(neighbors)
        else:
            print(f"No similar words found for '{word_for_neighbors}'.")
    except ValueError as e:
        print(f"Error finding similar words for '{word_for_neighbors}': {e}. Word might not be valid.")


    # --- Calculate Similarity Between Two Words (Manual Cosine Similarity) ---
    word1 = "علم" # Example word 1
    word2 = "روشنی" # Example word 2
    print(f"\nSimilarity between '{word1}' and '{word2}':")
    try:
        vec1 = model.get_word_vector(word1)
        vec2 = model.get_word_vector(word2)

        # Calculate cosine similarity
        norm1 = np.linalg.norm(vec1)
        norm2 = np.linalg.norm(vec2)

        if norm1 > 0 and norm2 > 0:
            cosine_similarity = np.dot(vec1, vec2) / (norm1 * norm2)
            print(f"Cosine similarity: {cosine_similarity}")
        else:
            print("Cannot compute similarity: zero vector detected for one or both words.")
    except ValueError as e:
        print(f"Error calculating similarity between '{word1}' and '{word2}': {e}. One or both words might not be valid.")

    # --- Using the .vec file (Optional) ---
    # The .vec file contains just the word vectors for words in the vocabulary.
    # It can be loaded by other libraries like Gensim or spaCy.
    # Note: This method *does not* utilize fastText's subword capabilities for OOV words.
    # For fastText specific features, use the .bin file.
    # Example (using gensim - requires gensim installation):
    # from gensim.models import KeyedVectors
    # vec_file_path = "path/to/your/downloaded/urdu_fasttext.vec"
    # try:
    #     # Load vectors in Word2Vec text format
    #     word_vectors = KeyedVectors.load_word2vec_format(vec_file_path, binary=False)
    #     print(f"\nLoaded {len(word_vectors.key_to_index)} vectors from .vec file using Gensim.")
    #     # Example: Find similar words using Gensim
    #     # print(word_vectors.most_similar("اردو"))
    # except Exception as e:
    #      print(f"Error loading .vec file with Gensim: {e}")


else:
    print("\nModel could not be loaded. Usage examples are skipped.")


**Steps after creating the Model Card content:**

1.  **Create a Model Repository on Hugging Face:** Go to huggingface.co, log in, click your profile picture -> "New model".
2.  **Name your Model:** Choose a descriptive name (e.g., `urdu-fasttext-word-embeddings`).
3.  **Set Visibility:** Choose Public or Private.
4.  **Create Model:** This creates an empty repository.
5.  **Upload Files:** Go to the "Files" tab of your new repository. You can either:
    *   Click "Add file" and upload `urdu_fasttext.bin`, `urdu_fasttext.vec`, and your training script file.
    *   Or, clone the repository locally and push the files using Git.
6.  **Edit Model Card:** Go to the "Model card" tab. This is where you paste and format the content prepared above. You can edit it directly in the browser using Markdown.
7.  **Fill in Placeholders:** Go through the content and replace all `[ ... ]` placeholders with your specific details (vector size, epochs, dataset source, license, your name, etc.).
8.  **Format with Markdown:** Use the formatting options (headers, bold, code blocks) to make the card readable.
9.  **Save Model Card:** Save the changes.

Your model will then be available on Hugging Face with the documentation you've provided.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ReySajju742/urdu-fasttext

Finetuned
(2)
this model