Spaces:
Running
title: OCR Time Capsule
emoji: π¦
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
OCR Time Capsule π¦
A fast, modern web interface for exploring and comparing OCR text improvements in HuggingFace datasets. Browse through pre-processed OCR improvements to see how AI models enhance historical document transcriptions.
Features
- Fast Navigation: Browse through large OCR datasets with keyboard shortcuts (J/K or arrow keys)
- Side-by-Side Comparison: View original OCR and improved text simultaneously
- Advanced Diff Visualization: Character, word, or line-level differences with color highlighting
- No Backend Required: Direct integration with HuggingFace Dataset Viewer API
- Responsive Design: Works seamlessly on desktop and mobile devices
- Dark Mode: Easy on the eyes for extended reading sessions
- URL Sharing: Share specific dataset samples with direct links
Quick Start
Option 1: Local Development
- Clone or download this directory
- Serve the files using any static web server:
# Using Python
python -m http.server 8000
# Using Node.js
npx serve .
# Using PHP
php -S localhost:8000
- Open http://localhost:8000 in your browser
Option 2: GitHub Pages
- Push this directory to a GitHub repository
- Enable GitHub Pages in repository settings
- Access via
https://[username].github.io/[repo-name]/
Option 3: Direct File Access
Simply open index.html
in a modern web browser. Note: Some features may be limited due to CORS restrictions.
Usage
Loading a Dataset
- Enter a HuggingFace dataset ID (e.g.,
davanstrien/exams-ocr
) - Click "Load" or press Enter
- The explorer will automatically detect text columns
Navigation
- Next: Press
J
orβ
arrow key - Previous: Press
K
orβ
arrow key - Switch Views: Press
1
(comparison),2
(diff), or3
(improved only)
Supported Column Names
The explorer automatically detects these column patterns:
Original OCR: text
, ocr
, original_text
, ground_truth
Improved OCR: markdown
, new_ocr
, corrected_text
, vlm_ocr
Technical Details
Architecture
βββββββββββββββββββ ββββββββββββββββββββββββ
β Browser UI ββββββΆβ HF Dataset Viewer APIβ
β (Alpine.js) β β (datasets-server) β
βββββββββββββββββββ ββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β Local Cache β
β (JavaScript) β
βββββββββββββββββββ
API Integration
Uses the HuggingFace Dataset Viewer API:
- Base URL:
https://datasets-server.huggingface.co
- No authentication required for public datasets
- Automatic handling of image URL expiration
- Smart batching for efficient data loading
Performance Optimizations
- Batch Loading: Fetches 100 rows at a time
- Smart Caching: Reduces API calls
- Lazy Loading: Only loads visible content
- URL Refresh: Automatically refreshes expired image URLs
Customization
Adding New Column Patterns
Edit js/dataset-api.js
and update the detectColumns
method:
if (!originalTextColumn && ['your_column_name'].includes(name)) {
originalTextColumn = name;
}
Styling
The UI uses Tailwind CSS. Modify styles in:
css/styles.css
for custom styles- Tailwind classes directly in
index.html
Keyboard Shortcuts
Add new shortcuts in js/app.js
:
case 'your_key':
// Your action here
break;
Browser Support
- Chrome/Edge: Full support
- Firefox: Full support
- Safari: Full support (14+)
- Mobile browsers: Full support with touch navigation
Limitations
- Maximum 100 rows per API request
- Image URLs expire after ~1 hour
- No authentication support for private datasets (yet)
- Read-only interface (no editing capabilities)
Future Enhancements
- Export functionality for improved texts
- Batch processing capabilities
- Search within dataset
- Bookmarking system
- Authentication for private datasets
- Confidence scores visualization
- Multi-dataset comparison
Troubleshooting
"Dataset viewer is not available"
- Check if the dataset exists on HuggingFace
- Ensure the dataset has viewer enabled
- Try a known working dataset like
davanstrien/exams-ocr
Images not loading
- Image URLs expire after ~1 hour
- The app automatically refreshes URLs on error
- Check browser console for detailed errors
Slow loading
- Large datasets may take time for initial load
- Consider using datasets with pre-computed statistics
- Check your internet connection
Contributing
This is a standalone tool designed for OCR exploration. Feel free to fork and customize for your needs!
License
MIT License - Use freely for any purpose
Related Projects
- OCR Time Machine - Interactive OCR improvement with VLMs
- HuggingFace Datasets - Browse available datasets
- Dataset Viewer Docs - API documentation