ocr-time-capsule / README.md
davanstrien's picture
davanstrien HF Staff
draft
10aaf2c
metadata
title: OCR Time Capsule
emoji: πŸ“¦
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false

OCR Time Capsule πŸ“¦

A fast, modern web interface for exploring and comparing OCR text improvements in HuggingFace datasets. Browse through pre-processed OCR improvements to see how AI models enhance historical document transcriptions.

OCR Time Capsule

Features

  • Fast Navigation: Browse through large OCR datasets with keyboard shortcuts (J/K or arrow keys)
  • Side-by-Side Comparison: View original OCR and improved text simultaneously
  • Advanced Diff Visualization: Character, word, or line-level differences with color highlighting
  • No Backend Required: Direct integration with HuggingFace Dataset Viewer API
  • Responsive Design: Works seamlessly on desktop and mobile devices
  • Dark Mode: Easy on the eyes for extended reading sessions
  • URL Sharing: Share specific dataset samples with direct links

Quick Start

Option 1: Local Development

  1. Clone or download this directory
  2. Serve the files using any static web server:
# Using Python
python -m http.server 8000

# Using Node.js
npx serve .

# Using PHP
php -S localhost:8000
  1. Open http://localhost:8000 in your browser

Option 2: GitHub Pages

  1. Push this directory to a GitHub repository
  2. Enable GitHub Pages in repository settings
  3. Access via https://[username].github.io/[repo-name]/

Option 3: Direct File Access

Simply open index.html in a modern web browser. Note: Some features may be limited due to CORS restrictions.

Usage

Loading a Dataset

  1. Enter a HuggingFace dataset ID (e.g., davanstrien/exams-ocr)
  2. Click "Load" or press Enter
  3. The explorer will automatically detect text columns

Navigation

  • Next: Press J or β†’ arrow key
  • Previous: Press K or ← arrow key
  • Switch Views: Press 1 (comparison), 2 (diff), or 3 (improved only)

Supported Column Names

The explorer automatically detects these column patterns:

Original OCR: text, ocr, original_text, ground_truth
Improved OCR: markdown, new_ocr, corrected_text, vlm_ocr

Technical Details

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Browser UI    │────▢│ HF Dataset Viewer APIβ”‚
β”‚  (Alpine.js)    β”‚     β”‚ (datasets-server)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Local Cache    β”‚
β”‚  (JavaScript)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

API Integration

Uses the HuggingFace Dataset Viewer API:

  • Base URL: https://datasets-server.huggingface.co
  • No authentication required for public datasets
  • Automatic handling of image URL expiration
  • Smart batching for efficient data loading

Performance Optimizations

  • Batch Loading: Fetches 100 rows at a time
  • Smart Caching: Reduces API calls
  • Lazy Loading: Only loads visible content
  • URL Refresh: Automatically refreshes expired image URLs

Customization

Adding New Column Patterns

Edit js/dataset-api.js and update the detectColumns method:

if (!originalTextColumn && ['your_column_name'].includes(name)) {
    originalTextColumn = name;
}

Styling

The UI uses Tailwind CSS. Modify styles in:

  • css/styles.css for custom styles
  • Tailwind classes directly in index.html

Keyboard Shortcuts

Add new shortcuts in js/app.js:

case 'your_key':
    // Your action here
    break;

Browser Support

  • Chrome/Edge: Full support
  • Firefox: Full support
  • Safari: Full support (14+)
  • Mobile browsers: Full support with touch navigation

Limitations

  • Maximum 100 rows per API request
  • Image URLs expire after ~1 hour
  • No authentication support for private datasets (yet)
  • Read-only interface (no editing capabilities)

Future Enhancements

  • Export functionality for improved texts
  • Batch processing capabilities
  • Search within dataset
  • Bookmarking system
  • Authentication for private datasets
  • Confidence scores visualization
  • Multi-dataset comparison

Troubleshooting

"Dataset viewer is not available"

  • Check if the dataset exists on HuggingFace
  • Ensure the dataset has viewer enabled
  • Try a known working dataset like davanstrien/exams-ocr

Images not loading

  • Image URLs expire after ~1 hour
  • The app automatically refreshes URLs on error
  • Check browser console for detailed errors

Slow loading

  • Large datasets may take time for initial load
  • Consider using datasets with pre-computed statistics
  • Check your internet connection

Contributing

This is a standalone tool designed for OCR exploration. Feel free to fork and customize for your needs!

License

MIT License - Use freely for any purpose

Related Projects