Spaces:
Running
Running
title: OCR Time Capsule | |
emoji: π¦ | |
colorFrom: blue | |
colorTo: indigo | |
sdk: static | |
pinned: false | |
# OCR Time Capsule π¦ | |
A fast, modern web interface for exploring and comparing OCR text improvements in HuggingFace datasets. Browse through pre-processed OCR improvements to see how AI models enhance historical document transcriptions. | |
 | |
## Features | |
- **Fast Navigation**: Browse through large OCR datasets with keyboard shortcuts (J/K or arrow keys) | |
- **Side-by-Side Comparison**: View original OCR and improved text simultaneously | |
- **Advanced Diff Visualization**: Character, word, or line-level differences with color highlighting | |
- **No Backend Required**: Direct integration with HuggingFace Dataset Viewer API | |
- **Responsive Design**: Works seamlessly on desktop and mobile devices | |
- **Dark Mode**: Easy on the eyes for extended reading sessions | |
- **URL Sharing**: Share specific dataset samples with direct links | |
## Quick Start | |
### Option 1: Local Development | |
1. Clone or download this directory | |
2. Serve the files using any static web server: | |
```bash | |
# Using Python | |
python -m http.server 8000 | |
# Using Node.js | |
npx serve . | |
# Using PHP | |
php -S localhost:8000 | |
``` | |
3. Open http://localhost:8000 in your browser | |
### Option 2: GitHub Pages | |
1. Push this directory to a GitHub repository | |
2. Enable GitHub Pages in repository settings | |
3. Access via `https://[username].github.io/[repo-name]/` | |
### Option 3: Direct File Access | |
Simply open `index.html` in a modern web browser. Note: Some features may be limited due to CORS restrictions. | |
## Usage | |
### Loading a Dataset | |
1. Enter a HuggingFace dataset ID (e.g., `davanstrien/exams-ocr`) | |
2. Click "Load" or press Enter | |
3. The explorer will automatically detect text columns | |
### Navigation | |
- **Next**: Press `J` or `β` arrow key | |
- **Previous**: Press `K` or `β` arrow key | |
- **Switch Views**: Press `1` (comparison), `2` (diff), or `3` (improved only) | |
### Supported Column Names | |
The explorer automatically detects these column patterns: | |
**Original OCR**: `text`, `ocr`, `original_text`, `ground_truth` | |
**Improved OCR**: `markdown`, `new_ocr`, `corrected_text`, `vlm_ocr` | |
## Technical Details | |
### Architecture | |
``` | |
βββββββββββββββββββ ββββββββββββββββββββββββ | |
β Browser UI ββββββΆβ HF Dataset Viewer APIβ | |
β (Alpine.js) β β (datasets-server) β | |
βββββββββββββββββββ ββββββββββββββββββββββββ | |
β | |
βΌ | |
βββββββββββββββββββ | |
β Local Cache β | |
β (JavaScript) β | |
βββββββββββββββββββ | |
``` | |
### API Integration | |
Uses the HuggingFace Dataset Viewer API: | |
- Base URL: `https://datasets-server.huggingface.co` | |
- No authentication required for public datasets | |
- Automatic handling of image URL expiration | |
- Smart batching for efficient data loading | |
### Performance Optimizations | |
- **Batch Loading**: Fetches 100 rows at a time | |
- **Smart Caching**: Reduces API calls | |
- **Lazy Loading**: Only loads visible content | |
- **URL Refresh**: Automatically refreshes expired image URLs | |
## Customization | |
### Adding New Column Patterns | |
Edit `js/dataset-api.js` and update the `detectColumns` method: | |
```javascript | |
if (!originalTextColumn && ['your_column_name'].includes(name)) { | |
originalTextColumn = name; | |
} | |
``` | |
### Styling | |
The UI uses Tailwind CSS. Modify styles in: | |
- `css/styles.css` for custom styles | |
- Tailwind classes directly in `index.html` | |
### Keyboard Shortcuts | |
Add new shortcuts in `js/app.js`: | |
```javascript | |
case 'your_key': | |
// Your action here | |
break; | |
``` | |
## Browser Support | |
- Chrome/Edge: Full support | |
- Firefox: Full support | |
- Safari: Full support (14+) | |
- Mobile browsers: Full support with touch navigation | |
## Limitations | |
- Maximum 100 rows per API request | |
- Image URLs expire after ~1 hour | |
- No authentication support for private datasets (yet) | |
- Read-only interface (no editing capabilities) | |
## Future Enhancements | |
- [ ] Export functionality for improved texts | |
- [ ] Batch processing capabilities | |
- [ ] Search within dataset | |
- [ ] Bookmarking system | |
- [ ] Authentication for private datasets | |
- [ ] Confidence scores visualization | |
- [ ] Multi-dataset comparison | |
## Troubleshooting | |
### "Dataset viewer is not available" | |
- Check if the dataset exists on HuggingFace | |
- Ensure the dataset has viewer enabled | |
- Try a known working dataset like `davanstrien/exams-ocr` | |
### Images not loading | |
- Image URLs expire after ~1 hour | |
- The app automatically refreshes URLs on error | |
- Check browser console for detailed errors | |
### Slow loading | |
- Large datasets may take time for initial load | |
- Consider using datasets with pre-computed statistics | |
- Check your internet connection | |
## Contributing | |
This is a standalone tool designed for OCR exploration. Feel free to fork and customize for your needs! | |
## License | |
MIT License - Use freely for any purpose | |
## Related Projects | |
- [OCR Time Machine](../app.py) - Interactive OCR improvement with VLMs | |
- [HuggingFace Datasets](https://huggingface.co/datasets) - Browse available datasets | |
- [Dataset Viewer Docs](https://huggingface.co/docs/dataset-viewer) - API documentation |