Spaces:

davanstrien
/

ocr-time-capsule

Running

App Files Files Community

davanstrien HF Staff commited on 3 days ago

Commit

c49cb47

1 Parent(s): b9aef14

Configure OCR Time Capsule with default dataset and branding

Browse files

Files changed (5) hide show

CLAUDE.md +186 -0
css/styles.css +197 -0
js/app.js +550 -0
js/dataset-api.js +273 -0
js/diff-utils.js +219 -0

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,186 @@

+# CLAUDE.md
+This file provides guidance to Claude Code (claude.ai/code) when working with the OCR Text Explorer.
+## Project Overview
+OCR Text Explorer is a modern, standalone web application for browsing and comparing OCR text improvements in HuggingFace datasets. Built as a lightweight alternative to the Gradio-based OCR Time Machine, it focuses specifically on exploring pre-OCR'd datasets with enhanced user experience.
+## Architecture
+### Technology Stack
+- **Frontend Framework**: Alpine.js (lightweight reactivity, ~15KB)
+- **Styling**: Tailwind CSS (utility-first, responsive design)
+- **Interactions**: HTMX (server-side rendering capabilities)
+- **API**: HuggingFace Dataset Viewer API (no backend required)
+- **Language**: Vanilla JavaScript (no build process needed)
+### Core Components
+**index.html** - Main application shell
+- Split-pane layout (1/3 image, 2/3 text comparison)
+- Three view modes: Side-by-side, Inline diff, Improved only
+- Dark mode support with proper contrast
+- Responsive design for mobile devices
+**js/dataset-api.js** - HuggingFace API wrapper
+- Smart caching with 45-minute expiration for signed URLs
+- Batch loading (100 rows at a time)
+- Automatic column detection for different dataset schemas
+- Image URL refresh on expiration
+**js/app.js** - Alpine.js application logic
+- Keyboard navigation (J/K, arrows)
+- URL state management for shareable links
+- Diff mode switching (character/word/line)
+- Dark mode persistence in localStorage
+**js/diff-utils.js** - Text comparison algorithms
+- Character-level diff with inline highlighting
+- Word-level diff preserving whitespace
+- Line-level diff for larger changes
+- LCS (Longest Common Subsequence) implementation
+**css/styles.css** - Custom styling
+- Dark mode enhancements
+- Diff highlighting with accessibility in mind
+- Smooth transitions and animations
+- Print-friendly styles
+## Key Design Decisions
+### Why Separate from OCR Time Machine?
+1. **Focused Purpose**: OCR Time Machine is for live OCR processing with VLMs (requires GPU), while this explorer is for browsing pre-processed results
+2. **Performance**: No Python/Gradio overhead - instant loading and navigation
+3. **User Experience**: Custom UI optimized for text comparison workflows
+4. **Deployment**: Static files can be hosted anywhere (GitHub Pages, CDN, etc.)
+### API vs Backend Trade-offs
+**Chose HF Dataset Viewer API because:**
+- No backend infrastructure needed
+- Automatic image serving with CDN
+- Built-in pagination support
+- Works with any public HF dataset
+**Limitations accepted:**
+- Image URLs expire (~1 hour)
+- 100 rows max per request
+- No write capabilities
+- Public datasets only (no auth yet)
+### UI/UX Principles
+1. **Keyboard-first**: Professional users prefer keyboard navigation
+2. **Information density**: Show more content, less chrome
+3. **Visual diff**: Color-coded changes are easier to scan than side-by-side
+4. **Dark mode**: Essential for extended reading sessions
+5. **Responsive**: Works on tablets for field work
+## Development Approach
+### Phase 1: MVP (Completed)
+- Basic dataset loading and navigation
+- Side-by-side text comparison
+- Keyboard shortcuts
+- Dark mode
+### Phase 2: Enhancements (Completed)
+- Three diff algorithms (char/word/line)
+- URL state management
+- Image error handling with refresh
+- Responsive mobile layout
+### Phase 3: Polish (Completed)
+- Fixed dark mode contrast issues
+- Optimized performance with direct indexing
+- Added loading states and error handling
+- Comprehensive documentation
+## Common Tasks
+### Adding Column Name Patterns
+```javascript
+// In dataset-api.js detectColumns() method
+if (!originalTextColumn && ['your_column_name'].includes(name)) {
+    originalTextColumn = name;
+}
+```
+### Adding Keyboard Shortcuts
+```javascript
+// In app.js setupKeyboardNavigation()
+case 'your_key':
+    // Your action
+    break;
+```
+### Customizing Diff Colors
+```javascript
+// In diff-utils.js
+// Light mode: bg-red-200, text-red-800
+// Dark mode: bg-red-950, text-red-300
+```
+## Performance Optimizations
+1. **Direct Dataset Indexing**: Uses `dataset[index]` instead of loading batches into memory
+2. **Smart Caching**: Caches API responses for 45 minutes (conservative for signed URLs)
+3. **Batch Fetching**: Loads 100 rows at once, caches for smooth navigation
+4. **Lazy Loading**: Only fetches data when needed
+## Known Issues & Solutions
+### Issue: Navigation buttons were disabled
+**Cause**: API response structure wasn't parsed correctly
+**Fix**: Updated getTotalRows() to check `size.config.num_rows` and `size.splits[0].num_rows`
+### Issue: Dark mode text unreadable
+**Cause**: Insufficient contrast in diff highlighting and code blocks
+**Fix**:
+- Changed diff colors to use `dark:bg-red-950` and `dark:text-red-300`
+- Added explicit `text-gray-900 dark:text-gray-100` to all text containers
+### Issue: Image loading errors
+**Cause**: Signed URLs expire after ~1 hour
+**Fix**: Implemented handleImageError() with automatic URL refresh
+## Future Enhancements
+- [ ] Search/filter within dataset
+- [ ] Bookmark favorite samples
+- [ ] Export selected texts
+- [ ] Support for private datasets (auth)
+- [ ] Metrics display (CER/WER)
+- [ ] Batch operations
+- [ ] PWA support for offline viewing
+## Deployment
+### Static Hosting (Recommended)
+```bash
+# Any static file server works
+python3 -m http.server 8000
+npx serve .
+```
+### GitHub Pages
+1. Push to GitHub repository
+2. Enable Pages in settings
+3. Access at: `https://[username].github.io/[repo]/ocr-text-explorer/`
+### CDN Deployment
+- Upload files to any CDN
+- No server-side processing needed
+- Works with CloudFlare, Netlify, Vercel, etc.
+## Testing Datasets
+Known working datasets:
+- `davanstrien/exams-ocr` - Default dataset with great examples
+- Any dataset with image + text columns
+Column patterns automatically detected:
+- Original: `text`, `ocr`, `original_text`, `ground_truth`
+- Improved: `markdown`, `new_ocr`, `corrected_text`, `vlm_ocr`

css/styles.css ADDED Viewed

	@@ -0,0 +1,197 @@

+/**
+ * Custom styles for OCR Text Explorer
+ * Extends Tailwind CSS with specific styling needs
+ */
+/* Custom scrollbar styling */
+::-webkit-scrollbar {
+    width: 8px;
+    height: 8px;
+}
+::-webkit-scrollbar-track {
+    @apply bg-gray-100 dark:bg-gray-800;
+}
+::-webkit-scrollbar-thumb {
+    @apply bg-gray-400 dark:bg-gray-600 rounded;
+}
+::-webkit-scrollbar-thumb:hover {
+    @apply bg-gray-500 dark:bg-gray-500;
+}
+/* Firefox scrollbar */
+* {
+    scrollbar-width: thin;
+    scrollbar-color: theme('colors.gray.400') theme('colors.gray.100');
+}
+.dark * {
+    scrollbar-color: theme('colors.gray.600') theme('colors.gray.800');
+}
+/* Smooth transitions for theme switching */
+body {
+    transition: background-color 0.3s ease, color 0.3s ease;
+}
+/* Image panel sticky positioning adjustment */
+.sticky {
+    position: -webkit-sticky;
+    position: sticky;
+}
+/* Diff content styling */
+.diff-content {
+    line-height: 1.6;
+    word-break: break-word;
+}
+/* Keyboard hint styling */
+kbd {
+    @apply inline-block px-2 py-1 text-xs font-semibold text-gray-800 bg-gray-100 border border-gray-300 rounded dark:bg-gray-700 dark:text-gray-200 dark:border-gray-600;
+    box-shadow: 0 1px 0 rgba(0, 0, 0, 0.1);
+}
+/* Loading spinner animation (in case Tailwind's animate-spin needs adjustment) */
+@keyframes spin {
+    to {
+        transform: rotate(360deg);
+    }
+}
+.animate-spin {
+    animation: spin 1s linear infinite;
+}
+/* Tab hover effect */
+nav button {
+    position: relative;
+    transition: color 0.2s ease;
+}
+nav button::after {
+    content: '';
+    position: absolute;
+    bottom: -2px;
+    left: 0;
+    right: 0;
+    height: 2px;
+    background-color: transparent;
+    transition: background-color 0.2s ease;
+}
+nav button:hover::after {
+    @apply bg-gray-300 dark:bg-gray-600;
+}
+/* Image loading state */
+img {
+    @apply bg-gray-200 dark:bg-gray-700;
+    min-height: 200px;
+}
+img[src=""] {
+    visibility: hidden;
+}
+/* Print styles */
+@media print {
+    header, footer {
+        display: none !important;
+    }
+    .no-print {
+        display: none !important;
+    }
+    main {
+        height: auto !important;
+    }
+    .diff-content {
+        page-break-inside: avoid;
+    }
+}
+/* Responsive adjustments */
+@media (max-width: 768px) {
+    /* Stack panels vertically on mobile */
+    main.flex {
+        @apply flex-col;
+    }
+    /* Full width for panels on mobile */
+    main > div:first-child {
+        @apply w-full max-h-96;
+    }
+    /* Adjust text size */
+    .prose-sm {
+        @apply text-xs;
+    }
+    /* Hide keyboard hints on mobile */
+    footer .text-sm:last-child {
+        @apply hidden;
+    }
+}
+/* Focus styles for accessibility */
+button:focus, input:focus, select:focus {
+    @apply outline-none ring-2 ring-blue-500 ring-offset-2 dark:ring-offset-gray-900;
+}
+/* Custom tooltip styles (if needed later) */
+.tooltip {
+    @apply invisible absolute z-10 px-2 py-1 text-xs text-white bg-gray-900 rounded shadow-lg dark:bg-gray-700;
+}
+.tooltip-trigger:hover .tooltip {
+    @apply visible;
+}
+/* Preserve whitespace in diff views */
+.whitespace-pre-wrap {
+    white-space: pre-wrap;
+    word-wrap: break-word;
+}
+/* Enhanced diff highlighting with better dark mode contrast */
+.diff-delete {
+    @apply bg-red-200 dark:bg-red-950 text-red-800 dark:text-red-300;
+    text-decoration: line-through;
+    text-decoration-color: currentColor;
+    text-decoration-thickness: 2px;
+}
+.diff-insert {
+    @apply bg-green-200 dark:bg-green-950 text-green-800 dark:text-green-300;
+    position: relative;
+}
+/* Dark mode specific improvements */
+.dark .prose {
+    @apply text-gray-200;
+}
+.dark .prose h3 {
+    @apply text-gray-100;
+}
+/* Remove this - handled inline with classes
+.dark pre {
+    @apply bg-gray-800 text-gray-200;
+} */
+/* Line numbers for future enhancement */
+.line-numbers {
+    counter-reset: line;
+}
+.line-numbers > div::before {
+    counter-increment: line;
+    content: counter(line);
+    @apply inline-block w-12 mr-4 text-right text-gray-400 dark:text-gray-600 select-none;
+}

js/app.js ADDED Viewed

	@@ -0,0 +1,550 @@

+/**
+ * Main Alpine.js application for OCR Text Explorer
+ */
+document.addEventListener('alpine:init', () => {
+    Alpine.data('ocrExplorer', () => ({
+        // Dataset state
+        datasetId: 'davanstrien/exams-ocr',
+        datasetConfig: 'default',
+        datasetSplit: 'train',
+        // Navigation state
+        currentIndex: 0,
+        totalSamples: null,
+        currentSample: null,
+        jumpToPage: '',
+        // UI state
+        loading: false,
+        error: null,
+        activeTab: 'comparison',
+        diffMode: 'char',
+        darkMode: false,
+        showAbout: false,
+        showFlowView: false,
+        showDock: false,
+        // Flow view state
+        flowItems: [],
+        flowStartIndex: 0,
+        flowVisibleCount: 7,
+        flowOffset: 0,
+        // Dock state
+        dockItems: [],
+        dockHideTimeout: null,
+        dockStartIndex: 0,
+        dockVisibleCount: 10,
+        // Computed diff HTML
+        diffHtml: '',
+        // Statistics
+        similarity: 0,
+        charStats: { total: 0, added: 0, removed: 0 },
+        wordStats: { original: 0, improved: 0 },
+        // API instance
+        api: null,
+        async init() {
+            // Initialize API
+            this.api = new DatasetAPI();
+            // Apply dark mode from localStorage
+            this.darkMode = localStorage.getItem('darkMode') === 'true';
+            this.$watch('darkMode', value => {
+                localStorage.setItem('darkMode', value);
+                document.documentElement.classList.toggle('dark', value);
+            });
+            document.documentElement.classList.toggle('dark', this.darkMode);
+            // Setup keyboard navigation
+            this.setupKeyboardNavigation();
+            // Load initial dataset
+            await this.loadDataset();
+        },
+        setupKeyboardNavigation() {
+            document.addEventListener('keydown', (e) => {
+                // Ignore if user is typing in input
+                if (e.target.tagName === 'INPUT') return;
+                switch(e.key) {
+                    case 'ArrowLeft':
+                        e.preventDefault();
+                        if (e.shiftKey && this.showDock) {
+                            this.scrollDockLeft();
+                        } else {
+                            this.previousSample();
+                        }
+                        break;
+                    case 'ArrowRight':
+                        e.preventDefault();
+                        if (e.shiftKey && this.showDock) {
+                            this.scrollDockRight();
+                        } else {
+                            this.nextSample();
+                        }
+                        break;
+                    case 'k':
+                    case 'K':
+                        e.preventDefault();
+                        this.previousSample();
+                        break;
+                    case 'j':
+                    case 'J':
+                        e.preventDefault();
+                        this.nextSample();
+                        break;
+                    case '1':
+                        this.activeTab = 'comparison';
+                        break;
+                    case '2':
+                        this.activeTab = 'diff';
+                        break;
+                    case '3':
+                        this.activeTab = 'improved';
+                        break;
+                    case 'v':
+                    case 'V':
+                        // Toggle dock with V key
+                        if (this.showDock) {
+                            this.hideDockPreview();
+                        } else {
+                            this.showDockPreview();
+                        }
+                        break;
+                }
+            });
+        },
+        async loadDataset() {
+            this.loading = true;
+            this.error = null;
+            try {
+                // Validate dataset
+                await this.api.validateDataset(this.datasetId);
+                // Get dataset info
+                const info = await this.api.getDatasetInfo(this.datasetId);
+                this.datasetConfig = info.defaultConfig;
+                this.datasetSplit = info.defaultSplit;
+                // Get total rows
+                this.totalSamples = await this.api.getTotalRows(
+                    this.datasetId,
+                    this.datasetConfig,
+                    this.datasetSplit
+                );
+                // Load first sample
+                this.currentIndex = 0;
+                await this.loadSample(0);
+            } catch (error) {
+                this.error = error.message;
+            } finally {
+                this.loading = false;
+            }
+        },
+        async loadSample(index) {
+            try {
+                const data = await this.api.getRow(
+                    this.datasetId,
+                    this.datasetConfig,
+                    this.datasetSplit,
+                    index
+                );
+                this.currentSample = data.row;
+                this.currentIndex = index;
+                // Update diff when sample changes
+                this.updateDiff();
+                // Update URL without triggering navigation
+                const url = new URL(window.location);
+                url.searchParams.set('dataset', this.datasetId);
+                url.searchParams.set('index', index);
+                window.history.replaceState({}, '', url);
+            } catch (error) {
+                this.error = `Failed to load sample: ${error.message}`;
+            }
+        },
+        async nextSample() {
+            if (this.currentIndex < this.totalSamples - 1) {
+                await this.loadSample(this.currentIndex + 1);
+            }
+        },
+        async previousSample() {
+            if (this.currentIndex > 0) {
+                await this.loadSample(this.currentIndex - 1);
+            }
+        },
+        async jumpToSample() {
+            const pageNum = parseInt(this.jumpToPage);
+            if (!isNaN(pageNum) && pageNum >= 1 && pageNum <= this.totalSamples) {
+                // Convert 1-based page number to 0-based index
+                await this.loadSample(pageNum - 1);
+                // Clear the input after jumping
+                this.jumpToPage = '';
+            } else {
+                // Show error or just reset
+                this.jumpToPage = '';
+            }
+        },
+        getOriginalText() {
+            if (!this.currentSample) return '';
+            const columns = this.api.detectColumns(null, this.currentSample);
+            return this.currentSample[columns.originalText] || 'No original text found';
+        },
+        getImprovedText() {
+            if (!this.currentSample) return '';
+            const columns = this.api.detectColumns(null, this.currentSample);
+            return this.currentSample[columns.improvedText] || 'No improved text found';
+        },
+        getImageData() {
+            if (!this.currentSample) return null;
+            const columns = this.api.detectColumns(null, this.currentSample);
+            return columns.image ? this.currentSample[columns.image] : null;
+        },
+        getImageSrc() {
+            const imageData = this.getImageData();
+            return imageData?.src || '';
+        },
+        getImageDimensions() {
+            const imageData = this.getImageData();
+            if (imageData?.width && imageData?.height) {
+                return `${imageData.width}×${imageData.height}`;
+            }
+            return null;
+        },
+        updateDiff() {
+            const original = this.getOriginalText();
+            const improved = this.getImprovedText();
+            // Calculate statistics
+            this.calculateStatistics(original, improved);
+            // Use diff utility based on mode
+            switch(this.diffMode) {
+                case 'char':
+                    this.diffHtml = createCharacterDiff(original, improved);
+                    break;
+                case 'word':
+                    this.diffHtml = createWordDiff(original, improved);
+                    break;
+                case 'line':
+                    this.diffHtml = createLineDiff(original, improved);
+                    break;
+            }
+        },
+        calculateStatistics(original, improved) {
+            // Calculate similarity
+            this.similarity = calculateSimilarity(original, improved);
+            // Character statistics
+            const charDiff = this.getCharacterDiffStats(original, improved);
+            this.charStats = charDiff;
+            // Word statistics
+            const originalWords = original.split(/\s+/).filter(w => w.length > 0);
+            const improvedWords = improved.split(/\s+/).filter(w => w.length > 0);
+            this.wordStats = {
+                original: originalWords.length,
+                improved: improvedWords.length
+            };
+        },
+        getCharacterDiffStats(original, improved) {
+            const dp = computeLCS(original, improved);
+            const diff = buildDiff(original, improved, dp);
+            let added = 0;
+            let removed = 0;
+            let unchanged = 0;
+            for (const part of diff) {
+                if (part.type === 'insert') {
+                    added += part.value.length;
+                } else if (part.type === 'delete') {
+                    removed += part.value.length;
+                } else {
+                    unchanged += part.value.length;
+                }
+            }
+            return {
+                total: original.length,
+                added: added,
+                removed: removed,
+                unchanged: unchanged
+            };
+        },
+        async handleImageError(event) {
+            // Try to refresh the image URL
+            console.log('Image failed to load, refreshing URL...');
+            try {
+                const data = await this.api.refreshImageUrl(
+                    this.datasetId,
+                    this.datasetConfig,
+                    this.datasetSplit,
+                    this.currentIndex
+                );
+                // Update the image source
+                if (data.row && data.row[this.api.detectColumns(null, data.row).image]?.src) {
+                    event.target.src = data.row[this.api.detectColumns(null, data.row).image].src;
+                }
+            } catch (error) {
+                console.error('Failed to refresh image URL:', error);
+                // Set a placeholder image
+                event.target.src = 'data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iNDAwIiBoZWlnaHQ9IjMwMCIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj48cmVjdCB3aWR0aD0iNDAwIiBoZWlnaHQ9IjMwMCIgZmlsbD0iI2VlZSIvPjx0ZXh0IHRleHQtYW5jaG9yPSJtaWRkbGUiIHg9IjIwMCIgeT0iMTUwIiBzdHlsZT0iZmlsbDojOTk5O2ZvbnQtZmFtaWx5OkFyaWFsLHNhbnMtc2VyaWY7Zm9udC1zaXplOjIwcHg7Zm9udC13ZWlnaHQ6Ym9sZCI+SW1hZ2UgVW5hdmFpbGFibGU8L3RleHQ+PC9zdmc+';
+            }
+        },
+        exportComparison() {
+            const original = this.getOriginalText();
+            const improved = this.getImprovedText();
+            const metadata = {
+                dataset: this.datasetId,
+                page: this.currentIndex + 1,
+                totalPages: this.totalSamples,
+                exportDate: new Date().toISOString(),
+                similarity: `${this.similarity}%`,
+                statistics: {
+                    characters: this.charStats,
+                    words: this.wordStats
+                }
+            };
+            // Create export content
+            let content = `OCR Text Comparison Export\n`;
+            content += `==========================\n\n`;
+            content += `Dataset: ${metadata.dataset}\n`;
+            content += `Page: ${metadata.page} of ${metadata.totalPages}\n`;
+            content += `Export Date: ${new Date().toLocaleString()}\n`;
+            content += `Similarity: ${metadata.similarity}\n`;
+            content += `Characters: ${metadata.statistics.characters.total} total, `;
+            content += `${metadata.statistics.characters.added} added, `;
+            content += `${metadata.statistics.characters.removed} removed\n`;
+            content += `Words: ${metadata.statistics.words.original} → ${metadata.statistics.words.improved}\n`;
+            content += `\n${'='.repeat(50)}\n\n`;
+            content += `ORIGINAL OCR:\n`;
+            content += `${'='.repeat(50)}\n`;
+            content += original;
+            content += `\n\n${'='.repeat(50)}\n\n`;
+            content += `IMPROVED OCR:\n`;
+            content += `${'='.repeat(50)}\n`;
+            content += improved;
+            // Download file
+            const blob = new Blob([content], { type: 'text/plain' });
+            const url = URL.createObjectURL(blob);
+            const a = document.createElement('a');
+            a.href = url;
+            a.download = `ocr-comparison-${this.datasetId.replace('/', '-')}-page-${this.currentIndex + 1}.txt`;
+            document.body.appendChild(a);
+            a.click();
+            document.body.removeChild(a);
+            URL.revokeObjectURL(url);
+        },
+        // Flow view methods
+        async toggleFlowView() {
+            this.showFlowView = !this.showFlowView;
+            if (this.showFlowView) {
+                // Reset to center around current page when opening
+                this.flowStartIndex = Math.max(0, this.currentIndex - Math.floor(this.flowVisibleCount / 2));
+                await this.loadFlowItems();
+            }
+        },
+        async loadFlowItems() {
+            // Load thumbnails from flowStartIndex
+            const startIdx = this.flowStartIndex;
+            this.flowItems = [];
+            // Load visible items
+            for (let i = 0; i < this.flowVisibleCount && (startIdx + i) < this.totalSamples; i++) {
+                const idx = startIdx + i;
+                try {
+                    const data = await this.api.getRow(
+                        this.datasetId,
+                        this.datasetConfig,
+                        this.datasetSplit,
+                        idx
+                    );
+                    const columns = this.api.detectColumns(null, data.row);
+                    const imageData = columns.image ? data.row[columns.image] : null;
+                    this.flowItems.push({
+                        index: idx,
+                        imageSrc: imageData?.src || '',
+                        row: data.row
+                    });
+                } catch (error) {
+                    console.error(`Failed to load flow item ${idx}:`, error);
+                }
+            }
+        },
+        scrollFlowLeft() {
+            if (this.flowStartIndex > 0) {
+                this.flowStartIndex = Math.max(0, this.flowStartIndex - this.flowVisibleCount);
+                this.loadFlowItems();
+            }
+        },
+        scrollFlowRight() {
+            if (this.flowStartIndex < this.totalSamples - this.flowVisibleCount) {
+                this.flowStartIndex = Math.min(
+                    this.totalSamples - this.flowVisibleCount,
+                    this.flowStartIndex + this.flowVisibleCount
+                );
+                this.loadFlowItems();
+            }
+        },
+        async jumpToFlowPage(index) {
+            this.showFlowView = false;
+            await this.loadSample(index);
+        },
+        async handleFlowImageError(event, index) {
+            // Try to refresh the image URL for flow item
+            try {
+                const data = await this.api.refreshImageUrl(
+                    this.datasetId,
+                    this.datasetConfig,
+                    this.datasetSplit,
+                    index
+                );
+                if (data.row) {
+                    const columns = this.api.detectColumns(null, data.row);
+                    const imageData = columns.image ? data.row[columns.image] : null;
+                    if (imageData?.src) {
+                        event.target.src = imageData.src;
+                        // Update the flow item
+                        const flowItem = this.flowItems.find(item => item.index === index);
+                        if (flowItem) {
+                            flowItem.imageSrc = imageData.src;
+                        }
+                    }
+                }
+            } catch (error) {
+                console.error('Failed to refresh flow image URL:', error);
+            }
+        },
+        // Dock methods
+        async showDockPreview() {
+            // Clear any hide timeout
+            if (this.dockHideTimeout) {
+                clearTimeout(this.dockHideTimeout);
+                this.dockHideTimeout = null;
+            }
+            this.showDock = true;
+            // Center dock around current page
+            this.dockStartIndex = Math.max(0,
+                Math.min(
+                    this.currentIndex - Math.floor(this.dockVisibleCount / 2),
+                    this.totalSamples - this.dockVisibleCount
+                )
+            );
+            // Always reload dock items to show current position
+            await this.loadDockItems();
+        },
+        hideDockPreview() {
+            // Add a small delay to prevent flickering
+            this.dockHideTimeout = setTimeout(() => {
+                this.showDock = false;
+            }, 300);
+        },
+        async loadDockItems() {
+            // Load thumbnails based on dock start index
+            const endIdx = Math.min(this.totalSamples, this.dockStartIndex + this.dockVisibleCount);
+            this.dockItems = [];
+            for (let i = this.dockStartIndex; i < endIdx; i++) {
+                try {
+                    const data = await this.api.getRow(
+                        this.datasetId,
+                        this.datasetConfig,
+                        this.datasetSplit,
+                        i
+                    );
+                    const columns = this.api.detectColumns(null, data.row);
+                    const imageData = columns.image ? data.row[columns.image] : null;
+                    this.dockItems.push({
+                        index: i,
+                        imageSrc: imageData?.src || '',
+                        row: data.row
+                    });
+                } catch (error) {
+                    console.error(`Failed to load dock item ${i}:`, error);
+                }
+            }
+        },
+        async scrollDockLeft() {
+            if (this.dockStartIndex > 0) {
+                this.dockStartIndex = Math.max(0, this.dockStartIndex - Math.floor(this.dockVisibleCount / 2));
+                await this.loadDockItems();
+            }
+        },
+        async scrollDockRight() {
+            if (this.dockStartIndex < this.totalSamples - this.dockVisibleCount) {
+                this.dockStartIndex = Math.min(
+                    this.totalSamples - this.dockVisibleCount,
+                    this.dockStartIndex + Math.floor(this.dockVisibleCount / 2)
+                );
+                await this.loadDockItems();
+            }
+        },
+        async jumpToDockPage(index) {
+            this.showDock = false;
+            await this.loadSample(index);
+        },
+        // Watch for diff mode changes
+        initWatchers() {
+            this.$watch('diffMode', () => this.updateDiff());
+            this.$watch('currentSample', () => this.updateDiff());
+        }
+    }));
+});
+// Initialize watchers after Alpine loads
+document.addEventListener('alpine:initialized', () => {
+    Alpine.store('ocrExplorer')?.initWatchers?.();
+});

js/dataset-api.js ADDED Viewed

	@@ -0,0 +1,273 @@

+/**
+ * HuggingFace Dataset Viewer API wrapper
+ * Handles fetching data from the datasets-server API with caching and error handling
+ */
+class DatasetAPI {
+    constructor() {
+        this.baseURL = 'https://datasets-server.huggingface.co';
+        this.cache = new Map();
+        this.cacheExpiry = 45 * 60 * 1000; // 45 minutes (conservative for signed URLs)
+        this.rowsPerFetch = 100; // API maximum
+    }
+    /**
+     * Check if a dataset is valid and has viewer enabled
+     */
+    async validateDataset(datasetId) {
+        try {
+            const response = await fetch(`${this.baseURL}/is-valid?dataset=${encodeURIComponent(datasetId)}`);
+            if (!response.ok) {
+                throw new Error(`Failed to validate dataset: ${response.statusText}`);
+            }
+            const data = await response.json();
+            if (!data.viewer) {
+                throw new Error('Dataset viewer is not available for this dataset');
+            }
+            return true;
+        } catch (error) {
+            throw new Error(`Dataset validation failed: ${error.message}`);
+        }
+    }
+    /**
+     * Get dataset info including splits and configs
+     */
+    async getDatasetInfo(datasetId) {
+        const cacheKey = `info_${datasetId}`;
+        const cached = this.getFromCache(cacheKey);
+        if (cached) return cached;
+        try {
+            const response = await fetch(`${this.baseURL}/splits?dataset=${encodeURIComponent(datasetId)}`);
+            if (!response.ok) {
+                throw new Error(`Failed to get dataset info: ${response.statusText}`);
+            }
+            const data = await response.json();
+            // Extract the default config and split
+            const defaultConfig = data.splits[0]?.config || 'default';
+            const defaultSplit = data.splits.find(s => s.split === 'train')?.split || data.splits[0]?.split || 'train';
+            const info = {
+                configs: [...new Set(data.splits.map(s => s.config))],
+                splits: [...new Set(data.splits.map(s => s.split))],
+                defaultConfig,
+                defaultSplit,
+                raw: data
+            };
+            this.setCache(cacheKey, info);
+            return info;
+        } catch (error) {
+            throw new Error(`Failed to get dataset info: ${error.message}`);
+        }
+    }
+    /**
+     * Get the total number of rows in a dataset
+     */
+    async getTotalRows(datasetId, config, split) {
+        const cacheKey = `size_${datasetId}_${config}_${split}`;
+        const cached = this.getFromCache(cacheKey);
+        if (cached) return cached;
+        try {
+            // First try to get from the size endpoint
+            const sizeResponse = await fetch(
+                `${this.baseURL}/size?dataset=${encodeURIComponent(datasetId)}&config=${encodeURIComponent(config)}&split=${encodeURIComponent(split)}`
+            );
+            if (sizeResponse.ok) {
+                const sizeData = await sizeResponse.json();
+                // The API returns num_rows in size.config or size.splits[0]
+                const size = sizeData.size?.config?.num_rows ||
+                           sizeData.size?.splits?.[0]?.num_rows ||
+                           0;
+                this.setCache(cacheKey, size);
+                return size;
+            }
+            // Fallback: get first rows and check num_rows_total
+            const rowsResponse = await fetch(
+                `${this.baseURL}/first-rows?dataset=${encodeURIComponent(datasetId)}&config=${encodeURIComponent(config)}&split=${encodeURIComponent(split)}`
+            );
+            if (!rowsResponse.ok) {
+                throw new Error('Unable to determine dataset size');
+            }
+            const rowsData = await rowsResponse.json();
+            const size = rowsData.num_rows_total || rowsData.rows?.length || 0;
+            this.setCache(cacheKey, size);
+            return size;
+        } catch (error) {
+            console.warn('Failed to get total rows:', error);
+            return null;
+        }
+    }
+    /**
+     * Fetch rows from the dataset
+     */
+    async fetchRows(datasetId, config, split, offset, length = this.rowsPerFetch) {
+        const cacheKey = `rows_${datasetId}_${config}_${split}_${offset}_${length}`;
+        const cached = this.getFromCache(cacheKey);
+        if (cached) return cached;
+        try {
+            const response = await fetch(
+                `${this.baseURL}/rows?dataset=${encodeURIComponent(datasetId)}&config=${encodeURIComponent(config)}&split=${encodeURIComponent(split)}&offset=${offset}&length=${length}`
+            );
+            if (!response.ok) {
+                if (response.status === 403) {
+                    throw new Error('Access denied. This dataset may be private or gated.');
+                }
+                throw new Error(`Failed to fetch rows: ${response.statusText}`);
+            }
+            const data = await response.json();
+            // Extract column information
+            const columns = this.detectColumns(data.features, data.rows[0]?.row);
+            const result = {
+                rows: data.rows,
+                features: data.features,
+                columns,
+                numRowsTotal: data.num_rows_total,
+                partial: data.partial || false
+            };
+            this.setCache(cacheKey, result);
+            return result;
+        } catch (error) {
+            throw new Error(`Failed to fetch rows: ${error.message}`);
+        }
+    }
+    /**
+     * Get a single row by index with smart batching
+     */
+    async getRow(datasetId, config, split, index) {
+        // Calculate which batch this index falls into
+        const batchStart = Math.floor(index / this.rowsPerFetch) * this.rowsPerFetch;
+        const batchData = await this.fetchRows(datasetId, config, split, batchStart, this.rowsPerFetch);
+        const localIndex = index - batchStart;
+        if (localIndex >= 0 && localIndex < batchData.rows.length) {
+            return {
+                row: batchData.rows[localIndex].row,
+                columns: batchData.columns,
+                numRowsTotal: batchData.numRowsTotal
+            };
+        }
+        throw new Error(`Row ${index} not found`);
+    }
+    /**
+     * Detect column names for image and text data
+     */
+    detectColumns(features, sampleRow) {
+        let imageColumn = null;
+        let originalTextColumn = null;
+        let improvedTextColumn = null;
+        // Try to detect from features first
+        for (const feature of features || []) {
+            const name = feature.name;
+            const type = feature.type;
+            // Detect image column
+            if (type._type === 'Image' || type.dtype === 'image' || type.feature?._type === 'Image') {
+                imageColumn = name;
+            }
+            // Detect text columns based on common patterns
+            if (!originalTextColumn && ['text', 'ocr', 'original_text', 'original', 'ground_truth'].includes(name)) {
+                originalTextColumn = name;
+            }
+            if (!improvedTextColumn && ['markdown', 'new_ocr', 'corrected_text', 'improved', 'vlm_ocr', 'corrected'].includes(name)) {
+                improvedTextColumn = name;
+            }
+        }
+        // Fallback: detect from sample row
+        if (sampleRow) {
+            const keys = Object.keys(sampleRow);
+            if (!imageColumn) {
+                for (const key of keys) {
+                    if (sampleRow[key]?.src && sampleRow[key]?.height !== undefined) {
+                        imageColumn = key;
+                        break;
+                    }
+                }
+            }
+            // Additional text column detection from row data
+            if (!originalTextColumn) {
+                const candidates = ['text', 'ocr', 'original_text', 'original'];
+                originalTextColumn = keys.find(k => candidates.includes(k)) || null;
+            }
+            if (!improvedTextColumn) {
+                const candidates = ['markdown', 'new_ocr', 'corrected_text', 'improved'];
+                improvedTextColumn = keys.find(k => candidates.includes(k)) || null;
+            }
+        }
+        return {
+            image: imageColumn,
+            originalText: originalTextColumn,
+            improvedText: improvedTextColumn
+        };
+    }
+    /**
+     * Refresh expired image URL by re-fetching the row
+     */
+    async refreshImageUrl(datasetId, config, split, index) {
+        // Clear cache for this specific row batch
+        const batchStart = Math.floor(index / this.rowsPerFetch) * this.rowsPerFetch;
+        const cacheKey = `rows_${datasetId}_${config}_${split}_${batchStart}_${this.rowsPerFetch}`;
+        this.cache.delete(cacheKey);
+        // Re-fetch the row
+        return await this.getRow(datasetId, config, split, index);
+    }
+    /**
+     * Cache management utilities
+     */
+    getFromCache(key) {
+        const cached = this.cache.get(key);
+        if (!cached) return null;
+        if (Date.now() - cached.timestamp > this.cacheExpiry) {
+            this.cache.delete(key);
+            return null;
+        }
+        return cached.data;
+    }
+    setCache(key, data) {
+        this.cache.set(key, {
+            data,
+            timestamp: Date.now()
+        });
+    }
+    clearCache() {
+        this.cache.clear();
+    }
+}
+// Export for use in other scripts
+window.DatasetAPI = DatasetAPI;

js/diff-utils.js ADDED Viewed

	@@ -0,0 +1,219 @@

+/**
+ * Text comparison utilities for OCR Text Explorer
+ * Provides character, word, and line-level diff visualization
+ */
+/**
+ * Create character-level diff with inline highlighting
+ */
+function createCharacterDiff(original, improved) {
+    if (!original || !improved) {
+        return '<p class="text-gray-500">No text to compare</p>';
+    }
+    const dp = computeLCS(original, improved);
+    const diff = buildDiff(original, improved, dp);
+    let html = '<div class="font-mono text-sm whitespace-pre-wrap text-gray-900 dark:text-gray-100">';
+    for (const part of diff) {
+        if (part.type === 'equal') {
+            html += escapeHtml(part.value);
+        } else if (part.type === 'delete') {
+            html += `<span class="bg-red-200 dark:bg-red-950 text-red-800 dark:text-red-300 line-through">${escapeHtml(part.value)}</span>`;
+        } else if (part.type === 'insert') {
+            html += `<span class="bg-green-200 dark:bg-green-950 text-green-800 dark:text-green-300">${escapeHtml(part.value)}</span>`;
+        }
+    }
+    html += '</div>';
+    return html;
+}
+/**
+ * Create word-level diff
+ */
+function createWordDiff(original, improved) {
+    if (!original || !improved) {
+        return '<p class="text-gray-500">No text to compare</p>';
+    }
+    // Split into words while preserving whitespace
+    const originalWords = splitIntoWords(original);
+    const improvedWords = splitIntoWords(improved);
+    const dp = computeLCS(originalWords, improvedWords);
+    const diff = buildDiff(originalWords, improvedWords, dp);
+    let html = '<div class="font-mono text-sm whitespace-pre-wrap text-gray-900 dark:text-gray-100">';
+    for (const part of diff) {
+        if (part.type === 'equal') {
+            html += escapeHtml(part.value.join(''));
+        } else if (part.type === 'delete') {
+            html += `<span class="bg-red-200 dark:bg-red-950 text-red-800 dark:text-red-300 line-through">${escapeHtml(part.value.join(''))}</span>`;
+        } else if (part.type === 'insert') {
+            html += `<span class="bg-green-200 dark:bg-green-950 text-green-800 dark:text-green-300">${escapeHtml(part.value.join(''))}</span>`;
+        }
+    }
+    html += '</div>';
+    return html;
+}
+/**
+ * Create line-level diff
+ */
+function createLineDiff(original, improved) {
+    if (!original || !improved) {
+        return '<p class="text-gray-500">No text to compare</p>';
+    }
+    const originalLines = original.split('\n');
+    const improvedLines = improved.split('\n');
+    const dp = computeLCS(originalLines, improvedLines);
+    const diff = buildDiff(originalLines, improvedLines, dp);
+    let html = '<div class="font-mono text-sm text-gray-900 dark:text-gray-100">';
+    for (const part of diff) {
+        if (part.type === 'equal') {
+            for (const line of part.value) {
+                html += `<div class="py-1">${escapeHtml(line)}</div>`;
+            }
+        } else if (part.type === 'delete') {
+            for (const line of part.value) {
+                html += `<div class="py-1 bg-red-200 dark:bg-red-950 text-red-800 dark:text-red-300 line-through">${escapeHtml(line)}</div>`;
+            }
+        } else if (part.type === 'insert') {
+            for (const line of part.value) {
+                html += `<div class="py-1 bg-green-200 dark:bg-green-950 text-green-800 dark:text-green-300">${escapeHtml(line)}</div>`;
+            }
+        }
+    }
+    html += '</div>';
+    return html;
+}
+/**
+ * Compute Longest Common Subsequence using dynamic programming
+ */
+function computeLCS(a, b) {
+    const m = a.length;
+    const n = b.length;
+    const dp = Array(m + 1).fill(null).map(() => Array(n + 1).fill(0));
+    for (let i = 1; i <= m; i++) {
+        for (let j = 1; j <= n; j++) {
+            if (a[i - 1] === b[j - 1]) {
+                dp[i][j] = dp[i - 1][j - 1] + 1;
+            } else {
+                dp[i][j] = Math.max(dp[i - 1][j], dp[i][j - 1]);
+            }
+        }
+    }
+    return dp;
+}
+/**
+ * Build diff from LCS table
+ */
+function buildDiff(a, b, dp) {
+    const diff = [];
+    let i = a.length;
+    let j = b.length;
+    while (i > 0 || j > 0) {
+        if (i > 0 && j > 0 && a[i - 1] === b[j - 1]) {
+            // Characters are equal
+            if (diff.length > 0 && diff[diff.length - 1].type === 'equal') {
+                diff[diff.length - 1].value.unshift(a[i - 1]);
+            } else {
+                diff.push({ type: 'equal', value: [a[i - 1]] });
+            }
+            i--;
+            j--;
+        } else if (j > 0 && (i === 0 || dp[i][j - 1] >= dp[i - 1][j])) {
+            // Character in b but not in a (insertion)
+            if (diff.length > 0 && diff[diff.length - 1].type === 'insert') {
+                diff[diff.length - 1].value.unshift(b[j - 1]);
+            } else {
+                diff.push({ type: 'insert', value: [b[j - 1]] });
+            }
+            j--;
+        } else {
+            // Character in a but not in b (deletion)
+            if (diff.length > 0 && diff[diff.length - 1].type === 'delete') {
+                diff[diff.length - 1].value.unshift(a[i - 1]);
+            } else {
+                diff.push({ type: 'delete', value: [a[i - 1]] });
+            }
+            i--;
+        }
+    }
+    diff.reverse();
+    // Convert arrays to strings for character diff
+    if (typeof a === 'string') {
+        diff.forEach(part => {
+            part.value = part.value.join('');
+        });
+    }
+    return diff;
+}
+/**
+ * Split text into words while preserving whitespace
+ */
+function splitIntoWords(text) {
+    const words = [];
+    let current = '';
+    let inWord = false;
+    for (const char of text) {
+        if (/\s/.test(char)) {
+            if (inWord && current) {
+                words.push(current);
+                current = '';
+                inWord = false;
+            }
+            words.push(char);
+        } else {
+            current += char;
+            inWord = true;
+        }
+    }
+    if (current) {
+        words.push(current);
+    }
+    return words;
+}
+/**
+ * Escape HTML special characters
+ */
+function escapeHtml(text) {
+    const div = document.createElement('div');
+    div.textContent = text;
+    return div.innerHTML;
+}
+/**
+ * Calculate similarity percentage between two texts
+ */
+function calculateSimilarity(original, improved) {
+    if (!original || !improved) return 0;
+    const dp = computeLCS(original, improved);
+    const lcsLength = dp[original.length][improved.length];
+    const maxLength = Math.max(original.length, improved.length);
+    return Math.round((lcsLength / maxLength) * 100);
+}