Spaces:
Running
Running
File size: 5,401 Bytes
84944f5 10aaf2c 84944f5 10aaf2c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
---
title: OCR Time Capsule
emoji: π¦
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
---
# OCR Time Capsule π¦
A fast, modern web interface for exploring and comparing OCR text improvements in HuggingFace datasets. Browse through pre-processed OCR improvements to see how AI models enhance historical document transcriptions.

## Features
- **Fast Navigation**: Browse through large OCR datasets with keyboard shortcuts (J/K or arrow keys)
- **Side-by-Side Comparison**: View original OCR and improved text simultaneously
- **Advanced Diff Visualization**: Character, word, or line-level differences with color highlighting
- **No Backend Required**: Direct integration with HuggingFace Dataset Viewer API
- **Responsive Design**: Works seamlessly on desktop and mobile devices
- **Dark Mode**: Easy on the eyes for extended reading sessions
- **URL Sharing**: Share specific dataset samples with direct links
## Quick Start
### Option 1: Local Development
1. Clone or download this directory
2. Serve the files using any static web server:
```bash
# Using Python
python -m http.server 8000
# Using Node.js
npx serve .
# Using PHP
php -S localhost:8000
```
3. Open http://localhost:8000 in your browser
### Option 2: GitHub Pages
1. Push this directory to a GitHub repository
2. Enable GitHub Pages in repository settings
3. Access via `https://[username].github.io/[repo-name]/`
### Option 3: Direct File Access
Simply open `index.html` in a modern web browser. Note: Some features may be limited due to CORS restrictions.
## Usage
### Loading a Dataset
1. Enter a HuggingFace dataset ID (e.g., `davanstrien/exams-ocr`)
2. Click "Load" or press Enter
3. The explorer will automatically detect text columns
### Navigation
- **Next**: Press `J` or `β` arrow key
- **Previous**: Press `K` or `β` arrow key
- **Switch Views**: Press `1` (comparison), `2` (diff), or `3` (improved only)
### Supported Column Names
The explorer automatically detects these column patterns:
**Original OCR**: `text`, `ocr`, `original_text`, `ground_truth`
**Improved OCR**: `markdown`, `new_ocr`, `corrected_text`, `vlm_ocr`
## Technical Details
### Architecture
```
βββββββββββββββββββ ββββββββββββββββββββββββ
β Browser UI ββββββΆβ HF Dataset Viewer APIβ
β (Alpine.js) β β (datasets-server) β
βββββββββββββββββββ ββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β Local Cache β
β (JavaScript) β
βββββββββββββββββββ
```
### API Integration
Uses the HuggingFace Dataset Viewer API:
- Base URL: `https://datasets-server.huggingface.co`
- No authentication required for public datasets
- Automatic handling of image URL expiration
- Smart batching for efficient data loading
### Performance Optimizations
- **Batch Loading**: Fetches 100 rows at a time
- **Smart Caching**: Reduces API calls
- **Lazy Loading**: Only loads visible content
- **URL Refresh**: Automatically refreshes expired image URLs
## Customization
### Adding New Column Patterns
Edit `js/dataset-api.js` and update the `detectColumns` method:
```javascript
if (!originalTextColumn && ['your_column_name'].includes(name)) {
originalTextColumn = name;
}
```
### Styling
The UI uses Tailwind CSS. Modify styles in:
- `css/styles.css` for custom styles
- Tailwind classes directly in `index.html`
### Keyboard Shortcuts
Add new shortcuts in `js/app.js`:
```javascript
case 'your_key':
// Your action here
break;
```
## Browser Support
- Chrome/Edge: Full support
- Firefox: Full support
- Safari: Full support (14+)
- Mobile browsers: Full support with touch navigation
## Limitations
- Maximum 100 rows per API request
- Image URLs expire after ~1 hour
- No authentication support for private datasets (yet)
- Read-only interface (no editing capabilities)
## Future Enhancements
- [ ] Export functionality for improved texts
- [ ] Batch processing capabilities
- [ ] Search within dataset
- [ ] Bookmarking system
- [ ] Authentication for private datasets
- [ ] Confidence scores visualization
- [ ] Multi-dataset comparison
## Troubleshooting
### "Dataset viewer is not available"
- Check if the dataset exists on HuggingFace
- Ensure the dataset has viewer enabled
- Try a known working dataset like `davanstrien/exams-ocr`
### Images not loading
- Image URLs expire after ~1 hour
- The app automatically refreshes URLs on error
- Check browser console for detailed errors
### Slow loading
- Large datasets may take time for initial load
- Consider using datasets with pre-computed statistics
- Check your internet connection
## Contributing
This is a standalone tool designed for OCR exploration. Feel free to fork and customize for your needs!
## License
MIT License - Use freely for any purpose
## Related Projects
- [OCR Time Machine](../app.py) - Interactive OCR improvement with VLMs
- [HuggingFace Datasets](https://huggingface.co/datasets) - Browse available datasets
- [Dataset Viewer Docs](https://huggingface.co/docs/dataset-viewer) - API documentation |