Spaces:

davanstrien
/

ocr-time-capsule

Running

File size: 5,401 Bytes

84944f5
10aaf2c
 
 
84944f5
 
 
 
 
10aaf2c

---
title: OCR Time Capsule
emoji: 📦
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
---

# OCR Time Capsule 📦

A fast, modern web interface for exploring and comparing OCR text improvements in HuggingFace datasets. Browse through pre-processed OCR improvements to see how AI models enhance historical document transcriptions.

![OCR Time Capsule](https://img.shields.io/badge/OCR-Time%20Capsule-blue)

## Features

- **Fast Navigation**: Browse through large OCR datasets with keyboard shortcuts (J/K or arrow keys)
- **Side-by-Side Comparison**: View original OCR and improved text simultaneously
- **Advanced Diff Visualization**: Character, word, or line-level differences with color highlighting
- **No Backend Required**: Direct integration with HuggingFace Dataset Viewer API
- **Responsive Design**: Works seamlessly on desktop and mobile devices
- **Dark Mode**: Easy on the eyes for extended reading sessions
- **URL Sharing**: Share specific dataset samples with direct links

## Quick Start

### Option 1: Local Development

1. Clone or download this directory
2. Serve the files using any static web server:

```bash
# Using Python
python -m http.server 8000

# Using Node.js
npx serve .

# Using PHP
php -S localhost:8000
```

3. Open http://localhost:8000 in your browser

### Option 2: GitHub Pages

1. Push this directory to a GitHub repository
2. Enable GitHub Pages in repository settings
3. Access via `https://[username].github.io/[repo-name]/`

### Option 3: Direct File Access

Simply open `index.html` in a modern web browser. Note: Some features may be limited due to CORS restrictions.

## Usage

### Loading a Dataset

1. Enter a HuggingFace dataset ID (e.g., `davanstrien/exams-ocr`)
2. Click "Load" or press Enter
3. The explorer will automatically detect text columns

### Navigation

- **Next**: Press `J` or `→` arrow key
- **Previous**: Press `K` or `←` arrow key
- **Switch Views**: Press `1` (comparison), `2` (diff), or `3` (improved only)

### Supported Column Names

The explorer automatically detects these column patterns:

**Original OCR**: `text`, `ocr`, `original_text`, `ground_truth`  
**Improved OCR**: `markdown`, `new_ocr`, `corrected_text`, `vlm_ocr`

## Technical Details

### Architecture

```
┌─────────────────┐     ┌──────────────────────┐
│   Browser UI    │────▶│ HF Dataset Viewer API│
│  (Alpine.js)    │     │ (datasets-server)    │
└─────────────────┘     └──────────────────────┘
        │
        ▼
┌─────────────────┐
│  Local Cache    │
│  (JavaScript)   │
└─────────────────┘
```

### API Integration

Uses the HuggingFace Dataset Viewer API:
- Base URL: `https://datasets-server.huggingface.co`
- No authentication required for public datasets
- Automatic handling of image URL expiration
- Smart batching for efficient data loading

### Performance Optimizations

- **Batch Loading**: Fetches 100 rows at a time
- **Smart Caching**: Reduces API calls
- **Lazy Loading**: Only loads visible content
- **URL Refresh**: Automatically refreshes expired image URLs

## Customization

### Adding New Column Patterns

Edit `js/dataset-api.js` and update the `detectColumns` method:

```javascript
if (!originalTextColumn && ['your_column_name'].includes(name)) {
    originalTextColumn = name;
}
```

### Styling

The UI uses Tailwind CSS. Modify styles in:
- `css/styles.css` for custom styles
- Tailwind classes directly in `index.html`

### Keyboard Shortcuts

Add new shortcuts in `js/app.js`:

```javascript
case 'your_key':
    // Your action here
    break;
```

## Browser Support

- Chrome/Edge: Full support
- Firefox: Full support
- Safari: Full support (14+)
- Mobile browsers: Full support with touch navigation

## Limitations

- Maximum 100 rows per API request
- Image URLs expire after ~1 hour
- No authentication support for private datasets (yet)
- Read-only interface (no editing capabilities)

## Future Enhancements

- [ ] Export functionality for improved texts
- [ ] Batch processing capabilities
- [ ] Search within dataset
- [ ] Bookmarking system
- [ ] Authentication for private datasets
- [ ] Confidence scores visualization
- [ ] Multi-dataset comparison

## Troubleshooting

### "Dataset viewer is not available"
- Check if the dataset exists on HuggingFace
- Ensure the dataset has viewer enabled
- Try a known working dataset like `davanstrien/exams-ocr`

### Images not loading
- Image URLs expire after ~1 hour
- The app automatically refreshes URLs on error
- Check browser console for detailed errors

### Slow loading
- Large datasets may take time for initial load
- Consider using datasets with pre-computed statistics
- Check your internet connection

## Contributing

This is a standalone tool designed for OCR exploration. Feel free to fork and customize for your needs!

## License

MIT License - Use freely for any purpose

## Related Projects

- [OCR Time Machine](../app.py) - Interactive OCR improvement with VLMs
- [HuggingFace Datasets](https://huggingface.co/datasets) - Browse available datasets
- [Dataset Viewer Docs](https://huggingface.co/docs/dataset-viewer) - API documentation