ocr-time-capsule / CLAUDE.md
davanstrien's picture
davanstrien HF Staff
Configure OCR Time Capsule with default dataset and branding
c49cb47

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with the OCR Text Explorer.

Project Overview

OCR Text Explorer is a modern, standalone web application for browsing and comparing OCR text improvements in HuggingFace datasets. Built as a lightweight alternative to the Gradio-based OCR Time Machine, it focuses specifically on exploring pre-OCR'd datasets with enhanced user experience.

Architecture

Technology Stack

  • Frontend Framework: Alpine.js (lightweight reactivity, ~15KB)
  • Styling: Tailwind CSS (utility-first, responsive design)
  • Interactions: HTMX (server-side rendering capabilities)
  • API: HuggingFace Dataset Viewer API (no backend required)
  • Language: Vanilla JavaScript (no build process needed)

Core Components

index.html - Main application shell

  • Split-pane layout (1/3 image, 2/3 text comparison)
  • Three view modes: Side-by-side, Inline diff, Improved only
  • Dark mode support with proper contrast
  • Responsive design for mobile devices

js/dataset-api.js - HuggingFace API wrapper

  • Smart caching with 45-minute expiration for signed URLs
  • Batch loading (100 rows at a time)
  • Automatic column detection for different dataset schemas
  • Image URL refresh on expiration

js/app.js - Alpine.js application logic

  • Keyboard navigation (J/K, arrows)
  • URL state management for shareable links
  • Diff mode switching (character/word/line)
  • Dark mode persistence in localStorage

js/diff-utils.js - Text comparison algorithms

  • Character-level diff with inline highlighting
  • Word-level diff preserving whitespace
  • Line-level diff for larger changes
  • LCS (Longest Common Subsequence) implementation

css/styles.css - Custom styling

  • Dark mode enhancements
  • Diff highlighting with accessibility in mind
  • Smooth transitions and animations
  • Print-friendly styles

Key Design Decisions

Why Separate from OCR Time Machine?

  1. Focused Purpose: OCR Time Machine is for live OCR processing with VLMs (requires GPU), while this explorer is for browsing pre-processed results
  2. Performance: No Python/Gradio overhead - instant loading and navigation
  3. User Experience: Custom UI optimized for text comparison workflows
  4. Deployment: Static files can be hosted anywhere (GitHub Pages, CDN, etc.)

API vs Backend Trade-offs

Chose HF Dataset Viewer API because:

  • No backend infrastructure needed
  • Automatic image serving with CDN
  • Built-in pagination support
  • Works with any public HF dataset

Limitations accepted:

  • Image URLs expire (~1 hour)
  • 100 rows max per request
  • No write capabilities
  • Public datasets only (no auth yet)

UI/UX Principles

  1. Keyboard-first: Professional users prefer keyboard navigation
  2. Information density: Show more content, less chrome
  3. Visual diff: Color-coded changes are easier to scan than side-by-side
  4. Dark mode: Essential for extended reading sessions
  5. Responsive: Works on tablets for field work

Development Approach

Phase 1: MVP (Completed)

  • Basic dataset loading and navigation
  • Side-by-side text comparison
  • Keyboard shortcuts
  • Dark mode

Phase 2: Enhancements (Completed)

  • Three diff algorithms (char/word/line)
  • URL state management
  • Image error handling with refresh
  • Responsive mobile layout

Phase 3: Polish (Completed)

  • Fixed dark mode contrast issues
  • Optimized performance with direct indexing
  • Added loading states and error handling
  • Comprehensive documentation

Common Tasks

Adding Column Name Patterns

// In dataset-api.js detectColumns() method
if (!originalTextColumn && ['your_column_name'].includes(name)) {
    originalTextColumn = name;
}

Adding Keyboard Shortcuts

// In app.js setupKeyboardNavigation()
case 'your_key':
    // Your action
    break;

Customizing Diff Colors

// In diff-utils.js
// Light mode: bg-red-200, text-red-800
// Dark mode: bg-red-950, text-red-300

Performance Optimizations

  1. Direct Dataset Indexing: Uses dataset[index] instead of loading batches into memory
  2. Smart Caching: Caches API responses for 45 minutes (conservative for signed URLs)
  3. Batch Fetching: Loads 100 rows at once, caches for smooth navigation
  4. Lazy Loading: Only fetches data when needed

Known Issues & Solutions

Issue: Navigation buttons were disabled

Cause: API response structure wasn't parsed correctly Fix: Updated getTotalRows() to check size.config.num_rows and size.splits[0].num_rows

Issue: Dark mode text unreadable

Cause: Insufficient contrast in diff highlighting and code blocks Fix:

  • Changed diff colors to use dark:bg-red-950 and dark:text-red-300
  • Added explicit text-gray-900 dark:text-gray-100 to all text containers

Issue: Image loading errors

Cause: Signed URLs expire after ~1 hour Fix: Implemented handleImageError() with automatic URL refresh

Future Enhancements

  • Search/filter within dataset
  • Bookmark favorite samples
  • Export selected texts
  • Support for private datasets (auth)
  • Metrics display (CER/WER)
  • Batch operations
  • PWA support for offline viewing

Deployment

Static Hosting (Recommended)

# Any static file server works
python3 -m http.server 8000
npx serve .

GitHub Pages

  1. Push to GitHub repository
  2. Enable Pages in settings
  3. Access at: https://[username].github.io/[repo]/ocr-text-explorer/

CDN Deployment

  • Upload files to any CDN
  • No server-side processing needed
  • Works with CloudFlare, Netlify, Vercel, etc.

Testing Datasets

Known working datasets:

  • davanstrien/exams-ocr - Default dataset with great examples
  • Any dataset with image + text columns

Column patterns automatically detected:

  • Original: text, ocr, original_text, ground_truth
  • Improved: markdown, new_ocr, corrected_text, vlm_ocr