ocr-time-capsule / README.md
davanstrien's picture
davanstrien HF Staff
draft
10aaf2c
---
title: OCR Time Capsule
emoji: πŸ“¦
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
---
# OCR Time Capsule πŸ“¦
A fast, modern web interface for exploring and comparing OCR text improvements in HuggingFace datasets. Browse through pre-processed OCR improvements to see how AI models enhance historical document transcriptions.
![OCR Time Capsule](https://img.shields.io/badge/OCR-Time%20Capsule-blue)
## Features
- **Fast Navigation**: Browse through large OCR datasets with keyboard shortcuts (J/K or arrow keys)
- **Side-by-Side Comparison**: View original OCR and improved text simultaneously
- **Advanced Diff Visualization**: Character, word, or line-level differences with color highlighting
- **No Backend Required**: Direct integration with HuggingFace Dataset Viewer API
- **Responsive Design**: Works seamlessly on desktop and mobile devices
- **Dark Mode**: Easy on the eyes for extended reading sessions
- **URL Sharing**: Share specific dataset samples with direct links
## Quick Start
### Option 1: Local Development
1. Clone or download this directory
2. Serve the files using any static web server:
```bash
# Using Python
python -m http.server 8000
# Using Node.js
npx serve .
# Using PHP
php -S localhost:8000
```
3. Open http://localhost:8000 in your browser
### Option 2: GitHub Pages
1. Push this directory to a GitHub repository
2. Enable GitHub Pages in repository settings
3. Access via `https://[username].github.io/[repo-name]/`
### Option 3: Direct File Access
Simply open `index.html` in a modern web browser. Note: Some features may be limited due to CORS restrictions.
## Usage
### Loading a Dataset
1. Enter a HuggingFace dataset ID (e.g., `davanstrien/exams-ocr`)
2. Click "Load" or press Enter
3. The explorer will automatically detect text columns
### Navigation
- **Next**: Press `J` or `β†’` arrow key
- **Previous**: Press `K` or `←` arrow key
- **Switch Views**: Press `1` (comparison), `2` (diff), or `3` (improved only)
### Supported Column Names
The explorer automatically detects these column patterns:
**Original OCR**: `text`, `ocr`, `original_text`, `ground_truth`
**Improved OCR**: `markdown`, `new_ocr`, `corrected_text`, `vlm_ocr`
## Technical Details
### Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Browser UI │────▢│ HF Dataset Viewer APIβ”‚
β”‚ (Alpine.js) β”‚ β”‚ (datasets-server) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Local Cache β”‚
β”‚ (JavaScript) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### API Integration
Uses the HuggingFace Dataset Viewer API:
- Base URL: `https://datasets-server.huggingface.co`
- No authentication required for public datasets
- Automatic handling of image URL expiration
- Smart batching for efficient data loading
### Performance Optimizations
- **Batch Loading**: Fetches 100 rows at a time
- **Smart Caching**: Reduces API calls
- **Lazy Loading**: Only loads visible content
- **URL Refresh**: Automatically refreshes expired image URLs
## Customization
### Adding New Column Patterns
Edit `js/dataset-api.js` and update the `detectColumns` method:
```javascript
if (!originalTextColumn && ['your_column_name'].includes(name)) {
originalTextColumn = name;
}
```
### Styling
The UI uses Tailwind CSS. Modify styles in:
- `css/styles.css` for custom styles
- Tailwind classes directly in `index.html`
### Keyboard Shortcuts
Add new shortcuts in `js/app.js`:
```javascript
case 'your_key':
// Your action here
break;
```
## Browser Support
- Chrome/Edge: Full support
- Firefox: Full support
- Safari: Full support (14+)
- Mobile browsers: Full support with touch navigation
## Limitations
- Maximum 100 rows per API request
- Image URLs expire after ~1 hour
- No authentication support for private datasets (yet)
- Read-only interface (no editing capabilities)
## Future Enhancements
- [ ] Export functionality for improved texts
- [ ] Batch processing capabilities
- [ ] Search within dataset
- [ ] Bookmarking system
- [ ] Authentication for private datasets
- [ ] Confidence scores visualization
- [ ] Multi-dataset comparison
## Troubleshooting
### "Dataset viewer is not available"
- Check if the dataset exists on HuggingFace
- Ensure the dataset has viewer enabled
- Try a known working dataset like `davanstrien/exams-ocr`
### Images not loading
- Image URLs expire after ~1 hour
- The app automatically refreshes URLs on error
- Check browser console for detailed errors
### Slow loading
- Large datasets may take time for initial load
- Consider using datasets with pre-computed statistics
- Check your internet connection
## Contributing
This is a standalone tool designed for OCR exploration. Feel free to fork and customize for your needs!
## License
MIT License - Use freely for any purpose
## Related Projects
- [OCR Time Machine](../app.py) - Interactive OCR improvement with VLMs
- [HuggingFace Datasets](https://huggingface.co/datasets) - Browse available datasets
- [Dataset Viewer Docs](https://huggingface.co/docs/dataset-viewer) - API documentation