Spaces:
Running
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with the OCR Text Explorer.
Project Overview
OCR Text Explorer is a modern, standalone web application for browsing and comparing OCR text improvements in HuggingFace datasets. Built as a lightweight alternative to the Gradio-based OCR Time Machine, it focuses specifically on exploring pre-OCR'd datasets with enhanced user experience.
Architecture
Technology Stack
- Frontend Framework: Alpine.js (lightweight reactivity, ~15KB)
- Styling: Tailwind CSS (utility-first, responsive design)
- Interactions: HTMX (server-side rendering capabilities)
- API: HuggingFace Dataset Viewer API (no backend required)
- Language: Vanilla JavaScript (no build process needed)
Core Components
index.html - Main application shell
- Split-pane layout (1/3 image, 2/3 text comparison)
- Three view modes: Side-by-side, Inline diff, Improved only
- Dark mode support with proper contrast
- Responsive design for mobile devices
js/dataset-api.js - HuggingFace API wrapper
- Smart caching with 45-minute expiration for signed URLs
- Batch loading (100 rows at a time)
- Automatic column detection for different dataset schemas
- Image URL refresh on expiration
js/app.js - Alpine.js application logic
- Keyboard navigation (J/K, arrows)
- URL state management for shareable links
- Diff mode switching (character/word/line)
- Dark mode persistence in localStorage
js/diff-utils.js - Text comparison algorithms
- Character-level diff with inline highlighting
- Word-level diff preserving whitespace
- Line-level diff for larger changes
- LCS (Longest Common Subsequence) implementation
css/styles.css - Custom styling
- Dark mode enhancements
- Diff highlighting with accessibility in mind
- Smooth transitions and animations
- Print-friendly styles
Key Design Decisions
Why Separate from OCR Time Machine?
- Focused Purpose: OCR Time Machine is for live OCR processing with VLMs (requires GPU), while this explorer is for browsing pre-processed results
- Performance: No Python/Gradio overhead - instant loading and navigation
- User Experience: Custom UI optimized for text comparison workflows
- Deployment: Static files can be hosted anywhere (GitHub Pages, CDN, etc.)
API vs Backend Trade-offs
Chose HF Dataset Viewer API because:
- No backend infrastructure needed
- Automatic image serving with CDN
- Built-in pagination support
- Works with any public HF dataset
Limitations accepted:
- Image URLs expire (~1 hour)
- 100 rows max per request
- No write capabilities
- Public datasets only (no auth yet)
UI/UX Principles
- Keyboard-first: Professional users prefer keyboard navigation
- Information density: Show more content, less chrome
- Visual diff: Color-coded changes are easier to scan than side-by-side
- Dark mode: Essential for extended reading sessions
- Responsive: Works on tablets for field work
Development Approach
Phase 1: MVP (Completed)
- Basic dataset loading and navigation
- Side-by-side text comparison
- Keyboard shortcuts
- Dark mode
Phase 2: Enhancements (Completed)
- Three diff algorithms (char/word/line)
- URL state management
- Image error handling with refresh
- Responsive mobile layout
Phase 3: Polish (Completed)
- Fixed dark mode contrast issues
- Optimized performance with direct indexing
- Added loading states and error handling
- Comprehensive documentation
Common Tasks
Adding Column Name Patterns
// In dataset-api.js detectColumns() method
if (!originalTextColumn && ['your_column_name'].includes(name)) {
originalTextColumn = name;
}
Adding Keyboard Shortcuts
// In app.js setupKeyboardNavigation()
case 'your_key':
// Your action
break;
Customizing Diff Colors
// In diff-utils.js
// Light mode: bg-red-200, text-red-800
// Dark mode: bg-red-950, text-red-300
Performance Optimizations
- Direct Dataset Indexing: Uses
dataset[index]
instead of loading batches into memory - Smart Caching: Caches API responses for 45 minutes (conservative for signed URLs)
- Batch Fetching: Loads 100 rows at once, caches for smooth navigation
- Lazy Loading: Only fetches data when needed
Known Issues & Solutions
Issue: Navigation buttons were disabled
Cause: API response structure wasn't parsed correctly
Fix: Updated getTotalRows() to check size.config.num_rows
and size.splits[0].num_rows
Issue: Dark mode text unreadable
Cause: Insufficient contrast in diff highlighting and code blocks Fix:
- Changed diff colors to use
dark:bg-red-950
anddark:text-red-300
- Added explicit
text-gray-900 dark:text-gray-100
to all text containers
Issue: Image loading errors
Cause: Signed URLs expire after ~1 hour Fix: Implemented handleImageError() with automatic URL refresh
Future Enhancements
- Search/filter within dataset
- Bookmark favorite samples
- Export selected texts
- Support for private datasets (auth)
- Metrics display (CER/WER)
- Batch operations
- PWA support for offline viewing
Deployment
Static Hosting (Recommended)
# Any static file server works
python3 -m http.server 8000
npx serve .
GitHub Pages
- Push to GitHub repository
- Enable Pages in settings
- Access at:
https://[username].github.io/[repo]/ocr-text-explorer/
CDN Deployment
- Upload files to any CDN
- No server-side processing needed
- Works with CloudFlare, Netlify, Vercel, etc.
Testing Datasets
Known working datasets:
davanstrien/exams-ocr
- Default dataset with great examples- Any dataset with image + text columns
Column patterns automatically detected:
- Original:
text
,ocr
,original_text
,ground_truth
- Improved:
markdown
,new_ocr
,corrected_text
,vlm_ocr