metadata

language: en
license: apache-2.0
base_model: Qwen/Qwen2.5-VL-3B-Instruct
tags:
  - vision
  - image-to-text
  - document-understanding
  - content-creators
  - tiktok

FinetunedQWEN Overlay Text Extractor

A specialized vision-language model that extracts overlaid text from images like captions, titles, and promotional text while ignoring background text.

Features

Specialized Text Extraction: Focuses on deliberately overlaid text elements
Real-time Processing: Deployed on Hugging Face Inference Endpoints
Simple JSON Interface: Easy to integrate with existing workflows
Lightweight Model: Based on Qwen2.5-VL-3B-Instruct with a fine-tuned adapter

Use Cases

Video caption extraction
Content moderation
Graphic design analysis
Accessibility improvements
Marketing analytics

Technical Details

Base Model: Qwen/Qwen2.5-VL-3B-Instruct
Fine-tuned Adapter: MohammedSameerSyed/FinetunedQWEN
Input: Base64-encoded image
Output: JSON with extracted text or "{none}" indicator

Quick Start

Test the model with this simple Python code:

import requests
import base64
import json

def test_model(image_path, endpoint_url):
    with open(image_path, "rb") as f:
        base64_image = base64.b64encode(f.read()).decode("utf-8")
    
    payload = json.dumps({"inputs": base64_image})
    headers = {"Content-Type": "application/json"}

    response = requests.post(endpoint_url, data=payload, headers=headers)
    return response.json()

image_path = "your_image.jpg"
endpoint_url = "YOUR_ENDPOINT_URL"
result = test_model(image_path, endpoint_url)
print(f"Extracted text: {result.get('overlay_text', 'None found')}")

API Usage

Basic request:

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"inputs": "BASE64_ENCODED_IMAGE"}' \
  YOUR_ENDPOINT_URL

With custom prefix:

{
  "inputs": "BASE64_ENCODED_IMAGE", 
  "parameters": {"prefix": "Extract overlay text: "}
}

Limitations

Works best with clear, deliberate text overlays
May struggle with noisy backgrounds or complex overlapping text
Limited support for non-Latin scripts
Performance varies with image quality

Performance Tips

Use high-contrast text for best results
Ensure overlay text is clearly distinguished from background
Avoid highly stylized fonts when possible
Test with your specific image types for optimal results

Ethical Considerations

Respect copyright when extracting text from images
Be mindful of privacy when processing images with personal information
Consider bias in text recognition performance across different languages

Contact

Maintainer: Mohammed Sameer Syed
Github: https://github.com/SyedMohammedSameer
Repository: MohammedSameerSyed/FinetunedQWEN

Acknowledgements

Qwen Team for the base Qwen2.5-VL-3B-Instruct model
Hugging Face for the infrastructure and tools