--- title: OCR Time Capsule emoji: πŸ“¦ colorFrom: blue colorTo: indigo sdk: static pinned: false --- # OCR Time Capsule πŸ“¦ A fast, modern web interface for exploring and comparing OCR text improvements in HuggingFace datasets. Browse through pre-processed OCR improvements to see how AI models enhance historical document transcriptions. ![OCR Time Capsule](https://img.shields.io/badge/OCR-Time%20Capsule-blue) ## Features - **Fast Navigation**: Browse through large OCR datasets with keyboard shortcuts (J/K or arrow keys) - **Side-by-Side Comparison**: View original OCR and improved text simultaneously - **Advanced Diff Visualization**: Character, word, or line-level differences with color highlighting - **No Backend Required**: Direct integration with HuggingFace Dataset Viewer API - **Responsive Design**: Works seamlessly on desktop and mobile devices - **Dark Mode**: Easy on the eyes for extended reading sessions - **URL Sharing**: Share specific dataset samples with direct links ## Quick Start ### Option 1: Local Development 1. Clone or download this directory 2. Serve the files using any static web server: ```bash # Using Python python -m http.server 8000 # Using Node.js npx serve . # Using PHP php -S localhost:8000 ``` 3. Open http://localhost:8000 in your browser ### Option 2: GitHub Pages 1. Push this directory to a GitHub repository 2. Enable GitHub Pages in repository settings 3. Access via `https://[username].github.io/[repo-name]/` ### Option 3: Direct File Access Simply open `index.html` in a modern web browser. Note: Some features may be limited due to CORS restrictions. ## Usage ### Loading a Dataset 1. Enter a HuggingFace dataset ID (e.g., `davanstrien/exams-ocr`) 2. Click "Load" or press Enter 3. The explorer will automatically detect text columns ### Navigation - **Next**: Press `J` or `β†’` arrow key - **Previous**: Press `K` or `←` arrow key - **Switch Views**: Press `1` (comparison), `2` (diff), or `3` (improved only) ### Supported Column Names The explorer automatically detects these column patterns: **Original OCR**: `text`, `ocr`, `original_text`, `ground_truth` **Improved OCR**: `markdown`, `new_ocr`, `corrected_text`, `vlm_ocr` ## Technical Details ### Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Browser UI │────▢│ HF Dataset Viewer APIβ”‚ β”‚ (Alpine.js) β”‚ β”‚ (datasets-server) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Local Cache β”‚ β”‚ (JavaScript) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ### API Integration Uses the HuggingFace Dataset Viewer API: - Base URL: `https://datasets-server.huggingface.co` - No authentication required for public datasets - Automatic handling of image URL expiration - Smart batching for efficient data loading ### Performance Optimizations - **Batch Loading**: Fetches 100 rows at a time - **Smart Caching**: Reduces API calls - **Lazy Loading**: Only loads visible content - **URL Refresh**: Automatically refreshes expired image URLs ## Customization ### Adding New Column Patterns Edit `js/dataset-api.js` and update the `detectColumns` method: ```javascript if (!originalTextColumn && ['your_column_name'].includes(name)) { originalTextColumn = name; } ``` ### Styling The UI uses Tailwind CSS. Modify styles in: - `css/styles.css` for custom styles - Tailwind classes directly in `index.html` ### Keyboard Shortcuts Add new shortcuts in `js/app.js`: ```javascript case 'your_key': // Your action here break; ``` ## Browser Support - Chrome/Edge: Full support - Firefox: Full support - Safari: Full support (14+) - Mobile browsers: Full support with touch navigation ## Limitations - Maximum 100 rows per API request - Image URLs expire after ~1 hour - No authentication support for private datasets (yet) - Read-only interface (no editing capabilities) ## Future Enhancements - [ ] Export functionality for improved texts - [ ] Batch processing capabilities - [ ] Search within dataset - [ ] Bookmarking system - [ ] Authentication for private datasets - [ ] Confidence scores visualization - [ ] Multi-dataset comparison ## Troubleshooting ### "Dataset viewer is not available" - Check if the dataset exists on HuggingFace - Ensure the dataset has viewer enabled - Try a known working dataset like `davanstrien/exams-ocr` ### Images not loading - Image URLs expire after ~1 hour - The app automatically refreshes URLs on error - Check browser console for detailed errors ### Slow loading - Large datasets may take time for initial load - Consider using datasets with pre-computed statistics - Check your internet connection ## Contributing This is a standalone tool designed for OCR exploration. Feel free to fork and customize for your needs! ## License MIT License - Use freely for any purpose ## Related Projects - [OCR Time Machine](../app.py) - Interactive OCR improvement with VLMs - [HuggingFace Datasets](https://huggingface.co/datasets) - Browse available datasets - [Dataset Viewer Docs](https://huggingface.co/docs/dataset-viewer) - API documentation