File size: 5,401 Bytes
84944f5
10aaf2c
 
 
84944f5
 
 
 
 
10aaf2c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
---
title: OCR Time Capsule
emoji: πŸ“¦
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
---

# OCR Time Capsule πŸ“¦

A fast, modern web interface for exploring and comparing OCR text improvements in HuggingFace datasets. Browse through pre-processed OCR improvements to see how AI models enhance historical document transcriptions.

![OCR Time Capsule](https://img.shields.io/badge/OCR-Time%20Capsule-blue)

## Features

- **Fast Navigation**: Browse through large OCR datasets with keyboard shortcuts (J/K or arrow keys)
- **Side-by-Side Comparison**: View original OCR and improved text simultaneously
- **Advanced Diff Visualization**: Character, word, or line-level differences with color highlighting
- **No Backend Required**: Direct integration with HuggingFace Dataset Viewer API
- **Responsive Design**: Works seamlessly on desktop and mobile devices
- **Dark Mode**: Easy on the eyes for extended reading sessions
- **URL Sharing**: Share specific dataset samples with direct links

## Quick Start

### Option 1: Local Development

1. Clone or download this directory
2. Serve the files using any static web server:

```bash
# Using Python
python -m http.server 8000

# Using Node.js
npx serve .

# Using PHP
php -S localhost:8000
```

3. Open http://localhost:8000 in your browser

### Option 2: GitHub Pages

1. Push this directory to a GitHub repository
2. Enable GitHub Pages in repository settings
3. Access via `https://[username].github.io/[repo-name]/`

### Option 3: Direct File Access

Simply open `index.html` in a modern web browser. Note: Some features may be limited due to CORS restrictions.

## Usage

### Loading a Dataset

1. Enter a HuggingFace dataset ID (e.g., `davanstrien/exams-ocr`)
2. Click "Load" or press Enter
3. The explorer will automatically detect text columns

### Navigation

- **Next**: Press `J` or `β†’` arrow key
- **Previous**: Press `K` or `←` arrow key
- **Switch Views**: Press `1` (comparison), `2` (diff), or `3` (improved only)

### Supported Column Names

The explorer automatically detects these column patterns:

**Original OCR**: `text`, `ocr`, `original_text`, `ground_truth`  
**Improved OCR**: `markdown`, `new_ocr`, `corrected_text`, `vlm_ocr`

## Technical Details

### Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Browser UI    │────▢│ HF Dataset Viewer APIβ”‚
β”‚  (Alpine.js)    β”‚     β”‚ (datasets-server)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Local Cache    β”‚
β”‚  (JavaScript)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### API Integration

Uses the HuggingFace Dataset Viewer API:
- Base URL: `https://datasets-server.huggingface.co`
- No authentication required for public datasets
- Automatic handling of image URL expiration
- Smart batching for efficient data loading

### Performance Optimizations

- **Batch Loading**: Fetches 100 rows at a time
- **Smart Caching**: Reduces API calls
- **Lazy Loading**: Only loads visible content
- **URL Refresh**: Automatically refreshes expired image URLs

## Customization

### Adding New Column Patterns

Edit `js/dataset-api.js` and update the `detectColumns` method:

```javascript
if (!originalTextColumn && ['your_column_name'].includes(name)) {
    originalTextColumn = name;
}
```

### Styling

The UI uses Tailwind CSS. Modify styles in:
- `css/styles.css` for custom styles
- Tailwind classes directly in `index.html`

### Keyboard Shortcuts

Add new shortcuts in `js/app.js`:

```javascript
case 'your_key':
    // Your action here
    break;
```

## Browser Support

- Chrome/Edge: Full support
- Firefox: Full support
- Safari: Full support (14+)
- Mobile browsers: Full support with touch navigation

## Limitations

- Maximum 100 rows per API request
- Image URLs expire after ~1 hour
- No authentication support for private datasets (yet)
- Read-only interface (no editing capabilities)

## Future Enhancements

- [ ] Export functionality for improved texts
- [ ] Batch processing capabilities
- [ ] Search within dataset
- [ ] Bookmarking system
- [ ] Authentication for private datasets
- [ ] Confidence scores visualization
- [ ] Multi-dataset comparison

## Troubleshooting

### "Dataset viewer is not available"
- Check if the dataset exists on HuggingFace
- Ensure the dataset has viewer enabled
- Try a known working dataset like `davanstrien/exams-ocr`

### Images not loading
- Image URLs expire after ~1 hour
- The app automatically refreshes URLs on error
- Check browser console for detailed errors

### Slow loading
- Large datasets may take time for initial load
- Consider using datasets with pre-computed statistics
- Check your internet connection

## Contributing

This is a standalone tool designed for OCR exploration. Feel free to fork and customize for your needs!

## License

MIT License - Use freely for any purpose

## Related Projects

- [OCR Time Machine](../app.py) - Interactive OCR improvement with VLMs
- [HuggingFace Datasets](https://huggingface.co/datasets) - Browse available datasets
- [Dataset Viewer Docs](https://huggingface.co/docs/dataset-viewer) - API documentation