davanstrien HF Staff commited on
Commit
c49cb47
·
1 Parent(s): b9aef14

Configure OCR Time Capsule with default dataset and branding

Browse files
Files changed (5) hide show
  1. CLAUDE.md +186 -0
  2. css/styles.css +197 -0
  3. js/app.js +550 -0
  4. js/dataset-api.js +273 -0
  5. js/diff-utils.js +219 -0
CLAUDE.md ADDED
@@ -0,0 +1,186 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CLAUDE.md
2
+
3
+ This file provides guidance to Claude Code (claude.ai/code) when working with the OCR Text Explorer.
4
+
5
+ ## Project Overview
6
+
7
+ OCR Text Explorer is a modern, standalone web application for browsing and comparing OCR text improvements in HuggingFace datasets. Built as a lightweight alternative to the Gradio-based OCR Time Machine, it focuses specifically on exploring pre-OCR'd datasets with enhanced user experience.
8
+
9
+ ## Architecture
10
+
11
+ ### Technology Stack
12
+ - **Frontend Framework**: Alpine.js (lightweight reactivity, ~15KB)
13
+ - **Styling**: Tailwind CSS (utility-first, responsive design)
14
+ - **Interactions**: HTMX (server-side rendering capabilities)
15
+ - **API**: HuggingFace Dataset Viewer API (no backend required)
16
+ - **Language**: Vanilla JavaScript (no build process needed)
17
+
18
+ ### Core Components
19
+
20
+ **index.html** - Main application shell
21
+ - Split-pane layout (1/3 image, 2/3 text comparison)
22
+ - Three view modes: Side-by-side, Inline diff, Improved only
23
+ - Dark mode support with proper contrast
24
+ - Responsive design for mobile devices
25
+
26
+ **js/dataset-api.js** - HuggingFace API wrapper
27
+ - Smart caching with 45-minute expiration for signed URLs
28
+ - Batch loading (100 rows at a time)
29
+ - Automatic column detection for different dataset schemas
30
+ - Image URL refresh on expiration
31
+
32
+ **js/app.js** - Alpine.js application logic
33
+ - Keyboard navigation (J/K, arrows)
34
+ - URL state management for shareable links
35
+ - Diff mode switching (character/word/line)
36
+ - Dark mode persistence in localStorage
37
+
38
+ **js/diff-utils.js** - Text comparison algorithms
39
+ - Character-level diff with inline highlighting
40
+ - Word-level diff preserving whitespace
41
+ - Line-level diff for larger changes
42
+ - LCS (Longest Common Subsequence) implementation
43
+
44
+ **css/styles.css** - Custom styling
45
+ - Dark mode enhancements
46
+ - Diff highlighting with accessibility in mind
47
+ - Smooth transitions and animations
48
+ - Print-friendly styles
49
+
50
+ ## Key Design Decisions
51
+
52
+ ### Why Separate from OCR Time Machine?
53
+
54
+ 1. **Focused Purpose**: OCR Time Machine is for live OCR processing with VLMs (requires GPU), while this explorer is for browsing pre-processed results
55
+ 2. **Performance**: No Python/Gradio overhead - instant loading and navigation
56
+ 3. **User Experience**: Custom UI optimized for text comparison workflows
57
+ 4. **Deployment**: Static files can be hosted anywhere (GitHub Pages, CDN, etc.)
58
+
59
+ ### API vs Backend Trade-offs
60
+
61
+ **Chose HF Dataset Viewer API because:**
62
+ - No backend infrastructure needed
63
+ - Automatic image serving with CDN
64
+ - Built-in pagination support
65
+ - Works with any public HF dataset
66
+
67
+ **Limitations accepted:**
68
+ - Image URLs expire (~1 hour)
69
+ - 100 rows max per request
70
+ - No write capabilities
71
+ - Public datasets only (no auth yet)
72
+
73
+ ### UI/UX Principles
74
+
75
+ 1. **Keyboard-first**: Professional users prefer keyboard navigation
76
+ 2. **Information density**: Show more content, less chrome
77
+ 3. **Visual diff**: Color-coded changes are easier to scan than side-by-side
78
+ 4. **Dark mode**: Essential for extended reading sessions
79
+ 5. **Responsive**: Works on tablets for field work
80
+
81
+ ## Development Approach
82
+
83
+ ### Phase 1: MVP (Completed)
84
+ - Basic dataset loading and navigation
85
+ - Side-by-side text comparison
86
+ - Keyboard shortcuts
87
+ - Dark mode
88
+
89
+ ### Phase 2: Enhancements (Completed)
90
+ - Three diff algorithms (char/word/line)
91
+ - URL state management
92
+ - Image error handling with refresh
93
+ - Responsive mobile layout
94
+
95
+ ### Phase 3: Polish (Completed)
96
+ - Fixed dark mode contrast issues
97
+ - Optimized performance with direct indexing
98
+ - Added loading states and error handling
99
+ - Comprehensive documentation
100
+
101
+ ## Common Tasks
102
+
103
+ ### Adding Column Name Patterns
104
+ ```javascript
105
+ // In dataset-api.js detectColumns() method
106
+ if (!originalTextColumn && ['your_column_name'].includes(name)) {
107
+ originalTextColumn = name;
108
+ }
109
+ ```
110
+
111
+ ### Adding Keyboard Shortcuts
112
+ ```javascript
113
+ // In app.js setupKeyboardNavigation()
114
+ case 'your_key':
115
+ // Your action
116
+ break;
117
+ ```
118
+
119
+ ### Customizing Diff Colors
120
+ ```javascript
121
+ // In diff-utils.js
122
+ // Light mode: bg-red-200, text-red-800
123
+ // Dark mode: bg-red-950, text-red-300
124
+ ```
125
+
126
+ ## Performance Optimizations
127
+
128
+ 1. **Direct Dataset Indexing**: Uses `dataset[index]` instead of loading batches into memory
129
+ 2. **Smart Caching**: Caches API responses for 45 minutes (conservative for signed URLs)
130
+ 3. **Batch Fetching**: Loads 100 rows at once, caches for smooth navigation
131
+ 4. **Lazy Loading**: Only fetches data when needed
132
+
133
+ ## Known Issues & Solutions
134
+
135
+ ### Issue: Navigation buttons were disabled
136
+ **Cause**: API response structure wasn't parsed correctly
137
+ **Fix**: Updated getTotalRows() to check `size.config.num_rows` and `size.splits[0].num_rows`
138
+
139
+ ### Issue: Dark mode text unreadable
140
+ **Cause**: Insufficient contrast in diff highlighting and code blocks
141
+ **Fix**:
142
+ - Changed diff colors to use `dark:bg-red-950` and `dark:text-red-300`
143
+ - Added explicit `text-gray-900 dark:text-gray-100` to all text containers
144
+
145
+ ### Issue: Image loading errors
146
+ **Cause**: Signed URLs expire after ~1 hour
147
+ **Fix**: Implemented handleImageError() with automatic URL refresh
148
+
149
+ ## Future Enhancements
150
+
151
+ - [ ] Search/filter within dataset
152
+ - [ ] Bookmark favorite samples
153
+ - [ ] Export selected texts
154
+ - [ ] Support for private datasets (auth)
155
+ - [ ] Metrics display (CER/WER)
156
+ - [ ] Batch operations
157
+ - [ ] PWA support for offline viewing
158
+
159
+ ## Deployment
160
+
161
+ ### Static Hosting (Recommended)
162
+ ```bash
163
+ # Any static file server works
164
+ python3 -m http.server 8000
165
+ npx serve .
166
+ ```
167
+
168
+ ### GitHub Pages
169
+ 1. Push to GitHub repository
170
+ 2. Enable Pages in settings
171
+ 3. Access at: `https://[username].github.io/[repo]/ocr-text-explorer/`
172
+
173
+ ### CDN Deployment
174
+ - Upload files to any CDN
175
+ - No server-side processing needed
176
+ - Works with CloudFlare, Netlify, Vercel, etc.
177
+
178
+ ## Testing Datasets
179
+
180
+ Known working datasets:
181
+ - `davanstrien/exams-ocr` - Default dataset with great examples
182
+ - Any dataset with image + text columns
183
+
184
+ Column patterns automatically detected:
185
+ - Original: `text`, `ocr`, `original_text`, `ground_truth`
186
+ - Improved: `markdown`, `new_ocr`, `corrected_text`, `vlm_ocr`
css/styles.css ADDED
@@ -0,0 +1,197 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ /**
2
+ * Custom styles for OCR Text Explorer
3
+ * Extends Tailwind CSS with specific styling needs
4
+ */
5
+
6
+ /* Custom scrollbar styling */
7
+ ::-webkit-scrollbar {
8
+ width: 8px;
9
+ height: 8px;
10
+ }
11
+
12
+ ::-webkit-scrollbar-track {
13
+ @apply bg-gray-100 dark:bg-gray-800;
14
+ }
15
+
16
+ ::-webkit-scrollbar-thumb {
17
+ @apply bg-gray-400 dark:bg-gray-600 rounded;
18
+ }
19
+
20
+ ::-webkit-scrollbar-thumb:hover {
21
+ @apply bg-gray-500 dark:bg-gray-500;
22
+ }
23
+
24
+ /* Firefox scrollbar */
25
+ * {
26
+ scrollbar-width: thin;
27
+ scrollbar-color: theme('colors.gray.400') theme('colors.gray.100');
28
+ }
29
+
30
+ .dark * {
31
+ scrollbar-color: theme('colors.gray.600') theme('colors.gray.800');
32
+ }
33
+
34
+ /* Smooth transitions for theme switching */
35
+ body {
36
+ transition: background-color 0.3s ease, color 0.3s ease;
37
+ }
38
+
39
+ /* Image panel sticky positioning adjustment */
40
+ .sticky {
41
+ position: -webkit-sticky;
42
+ position: sticky;
43
+ }
44
+
45
+ /* Diff content styling */
46
+ .diff-content {
47
+ line-height: 1.6;
48
+ word-break: break-word;
49
+ }
50
+
51
+ /* Keyboard hint styling */
52
+ kbd {
53
+ @apply inline-block px-2 py-1 text-xs font-semibold text-gray-800 bg-gray-100 border border-gray-300 rounded dark:bg-gray-700 dark:text-gray-200 dark:border-gray-600;
54
+ box-shadow: 0 1px 0 rgba(0, 0, 0, 0.1);
55
+ }
56
+
57
+ /* Loading spinner animation (in case Tailwind's animate-spin needs adjustment) */
58
+ @keyframes spin {
59
+ to {
60
+ transform: rotate(360deg);
61
+ }
62
+ }
63
+
64
+ .animate-spin {
65
+ animation: spin 1s linear infinite;
66
+ }
67
+
68
+ /* Tab hover effect */
69
+ nav button {
70
+ position: relative;
71
+ transition: color 0.2s ease;
72
+ }
73
+
74
+ nav button::after {
75
+ content: '';
76
+ position: absolute;
77
+ bottom: -2px;
78
+ left: 0;
79
+ right: 0;
80
+ height: 2px;
81
+ background-color: transparent;
82
+ transition: background-color 0.2s ease;
83
+ }
84
+
85
+ nav button:hover::after {
86
+ @apply bg-gray-300 dark:bg-gray-600;
87
+ }
88
+
89
+ /* Image loading state */
90
+ img {
91
+ @apply bg-gray-200 dark:bg-gray-700;
92
+ min-height: 200px;
93
+ }
94
+
95
+ img[src=""] {
96
+ visibility: hidden;
97
+ }
98
+
99
+ /* Print styles */
100
+ @media print {
101
+ header, footer {
102
+ display: none !important;
103
+ }
104
+
105
+ .no-print {
106
+ display: none !important;
107
+ }
108
+
109
+ main {
110
+ height: auto !important;
111
+ }
112
+
113
+ .diff-content {
114
+ page-break-inside: avoid;
115
+ }
116
+ }
117
+
118
+ /* Responsive adjustments */
119
+ @media (max-width: 768px) {
120
+ /* Stack panels vertically on mobile */
121
+ main.flex {
122
+ @apply flex-col;
123
+ }
124
+
125
+ /* Full width for panels on mobile */
126
+ main > div:first-child {
127
+ @apply w-full max-h-96;
128
+ }
129
+
130
+ /* Adjust text size */
131
+ .prose-sm {
132
+ @apply text-xs;
133
+ }
134
+
135
+ /* Hide keyboard hints on mobile */
136
+ footer .text-sm:last-child {
137
+ @apply hidden;
138
+ }
139
+ }
140
+
141
+ /* Focus styles for accessibility */
142
+ button:focus, input:focus, select:focus {
143
+ @apply outline-none ring-2 ring-blue-500 ring-offset-2 dark:ring-offset-gray-900;
144
+ }
145
+
146
+ /* Custom tooltip styles (if needed later) */
147
+ .tooltip {
148
+ @apply invisible absolute z-10 px-2 py-1 text-xs text-white bg-gray-900 rounded shadow-lg dark:bg-gray-700;
149
+ }
150
+
151
+ .tooltip-trigger:hover .tooltip {
152
+ @apply visible;
153
+ }
154
+
155
+ /* Preserve whitespace in diff views */
156
+ .whitespace-pre-wrap {
157
+ white-space: pre-wrap;
158
+ word-wrap: break-word;
159
+ }
160
+
161
+ /* Enhanced diff highlighting with better dark mode contrast */
162
+ .diff-delete {
163
+ @apply bg-red-200 dark:bg-red-950 text-red-800 dark:text-red-300;
164
+ text-decoration: line-through;
165
+ text-decoration-color: currentColor;
166
+ text-decoration-thickness: 2px;
167
+ }
168
+
169
+ .diff-insert {
170
+ @apply bg-green-200 dark:bg-green-950 text-green-800 dark:text-green-300;
171
+ position: relative;
172
+ }
173
+
174
+ /* Dark mode specific improvements */
175
+ .dark .prose {
176
+ @apply text-gray-200;
177
+ }
178
+
179
+ .dark .prose h3 {
180
+ @apply text-gray-100;
181
+ }
182
+
183
+ /* Remove this - handled inline with classes
184
+ .dark pre {
185
+ @apply bg-gray-800 text-gray-200;
186
+ } */
187
+
188
+ /* Line numbers for future enhancement */
189
+ .line-numbers {
190
+ counter-reset: line;
191
+ }
192
+
193
+ .line-numbers > div::before {
194
+ counter-increment: line;
195
+ content: counter(line);
196
+ @apply inline-block w-12 mr-4 text-right text-gray-400 dark:text-gray-600 select-none;
197
+ }
js/app.js ADDED
@@ -0,0 +1,550 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ /**
2
+ * Main Alpine.js application for OCR Text Explorer
3
+ */
4
+
5
+ document.addEventListener('alpine:init', () => {
6
+ Alpine.data('ocrExplorer', () => ({
7
+ // Dataset state
8
+ datasetId: 'davanstrien/exams-ocr',
9
+ datasetConfig: 'default',
10
+ datasetSplit: 'train',
11
+
12
+ // Navigation state
13
+ currentIndex: 0,
14
+ totalSamples: null,
15
+ currentSample: null,
16
+ jumpToPage: '',
17
+
18
+ // UI state
19
+ loading: false,
20
+ error: null,
21
+ activeTab: 'comparison',
22
+ diffMode: 'char',
23
+ darkMode: false,
24
+ showAbout: false,
25
+ showFlowView: false,
26
+ showDock: false,
27
+
28
+ // Flow view state
29
+ flowItems: [],
30
+ flowStartIndex: 0,
31
+ flowVisibleCount: 7,
32
+ flowOffset: 0,
33
+
34
+ // Dock state
35
+ dockItems: [],
36
+ dockHideTimeout: null,
37
+ dockStartIndex: 0,
38
+ dockVisibleCount: 10,
39
+
40
+ // Computed diff HTML
41
+ diffHtml: '',
42
+
43
+ // Statistics
44
+ similarity: 0,
45
+ charStats: { total: 0, added: 0, removed: 0 },
46
+ wordStats: { original: 0, improved: 0 },
47
+
48
+ // API instance
49
+ api: null,
50
+
51
+ async init() {
52
+ // Initialize API
53
+ this.api = new DatasetAPI();
54
+
55
+ // Apply dark mode from localStorage
56
+ this.darkMode = localStorage.getItem('darkMode') === 'true';
57
+ this.$watch('darkMode', value => {
58
+ localStorage.setItem('darkMode', value);
59
+ document.documentElement.classList.toggle('dark', value);
60
+ });
61
+ document.documentElement.classList.toggle('dark', this.darkMode);
62
+
63
+ // Setup keyboard navigation
64
+ this.setupKeyboardNavigation();
65
+
66
+ // Load initial dataset
67
+ await this.loadDataset();
68
+ },
69
+
70
+ setupKeyboardNavigation() {
71
+ document.addEventListener('keydown', (e) => {
72
+ // Ignore if user is typing in input
73
+ if (e.target.tagName === 'INPUT') return;
74
+
75
+ switch(e.key) {
76
+ case 'ArrowLeft':
77
+ e.preventDefault();
78
+ if (e.shiftKey && this.showDock) {
79
+ this.scrollDockLeft();
80
+ } else {
81
+ this.previousSample();
82
+ }
83
+ break;
84
+ case 'ArrowRight':
85
+ e.preventDefault();
86
+ if (e.shiftKey && this.showDock) {
87
+ this.scrollDockRight();
88
+ } else {
89
+ this.nextSample();
90
+ }
91
+ break;
92
+ case 'k':
93
+ case 'K':
94
+ e.preventDefault();
95
+ this.previousSample();
96
+ break;
97
+ case 'j':
98
+ case 'J':
99
+ e.preventDefault();
100
+ this.nextSample();
101
+ break;
102
+ case '1':
103
+ this.activeTab = 'comparison';
104
+ break;
105
+ case '2':
106
+ this.activeTab = 'diff';
107
+ break;
108
+ case '3':
109
+ this.activeTab = 'improved';
110
+ break;
111
+ case 'v':
112
+ case 'V':
113
+ // Toggle dock with V key
114
+ if (this.showDock) {
115
+ this.hideDockPreview();
116
+ } else {
117
+ this.showDockPreview();
118
+ }
119
+ break;
120
+ }
121
+ });
122
+ },
123
+
124
+ async loadDataset() {
125
+ this.loading = true;
126
+ this.error = null;
127
+
128
+ try {
129
+ // Validate dataset
130
+ await this.api.validateDataset(this.datasetId);
131
+
132
+ // Get dataset info
133
+ const info = await this.api.getDatasetInfo(this.datasetId);
134
+ this.datasetConfig = info.defaultConfig;
135
+ this.datasetSplit = info.defaultSplit;
136
+
137
+ // Get total rows
138
+ this.totalSamples = await this.api.getTotalRows(
139
+ this.datasetId,
140
+ this.datasetConfig,
141
+ this.datasetSplit
142
+ );
143
+
144
+ // Load first sample
145
+ this.currentIndex = 0;
146
+ await this.loadSample(0);
147
+
148
+ } catch (error) {
149
+ this.error = error.message;
150
+ } finally {
151
+ this.loading = false;
152
+ }
153
+ },
154
+
155
+ async loadSample(index) {
156
+ try {
157
+ const data = await this.api.getRow(
158
+ this.datasetId,
159
+ this.datasetConfig,
160
+ this.datasetSplit,
161
+ index
162
+ );
163
+
164
+ this.currentSample = data.row;
165
+ this.currentIndex = index;
166
+
167
+ // Update diff when sample changes
168
+ this.updateDiff();
169
+
170
+ // Update URL without triggering navigation
171
+ const url = new URL(window.location);
172
+ url.searchParams.set('dataset', this.datasetId);
173
+ url.searchParams.set('index', index);
174
+ window.history.replaceState({}, '', url);
175
+
176
+ } catch (error) {
177
+ this.error = `Failed to load sample: ${error.message}`;
178
+ }
179
+ },
180
+
181
+ async nextSample() {
182
+ if (this.currentIndex < this.totalSamples - 1) {
183
+ await this.loadSample(this.currentIndex + 1);
184
+ }
185
+ },
186
+
187
+ async previousSample() {
188
+ if (this.currentIndex > 0) {
189
+ await this.loadSample(this.currentIndex - 1);
190
+ }
191
+ },
192
+
193
+ async jumpToSample() {
194
+ const pageNum = parseInt(this.jumpToPage);
195
+ if (!isNaN(pageNum) && pageNum >= 1 && pageNum <= this.totalSamples) {
196
+ // Convert 1-based page number to 0-based index
197
+ await this.loadSample(pageNum - 1);
198
+ // Clear the input after jumping
199
+ this.jumpToPage = '';
200
+ } else {
201
+ // Show error or just reset
202
+ this.jumpToPage = '';
203
+ }
204
+ },
205
+
206
+ getOriginalText() {
207
+ if (!this.currentSample) return '';
208
+ const columns = this.api.detectColumns(null, this.currentSample);
209
+ return this.currentSample[columns.originalText] || 'No original text found';
210
+ },
211
+
212
+ getImprovedText() {
213
+ if (!this.currentSample) return '';
214
+ const columns = this.api.detectColumns(null, this.currentSample);
215
+ return this.currentSample[columns.improvedText] || 'No improved text found';
216
+ },
217
+
218
+ getImageData() {
219
+ if (!this.currentSample) return null;
220
+ const columns = this.api.detectColumns(null, this.currentSample);
221
+ return columns.image ? this.currentSample[columns.image] : null;
222
+ },
223
+
224
+ getImageSrc() {
225
+ const imageData = this.getImageData();
226
+ return imageData?.src || '';
227
+ },
228
+
229
+ getImageDimensions() {
230
+ const imageData = this.getImageData();
231
+ if (imageData?.width && imageData?.height) {
232
+ return `${imageData.width}×${imageData.height}`;
233
+ }
234
+ return null;
235
+ },
236
+
237
+ updateDiff() {
238
+ const original = this.getOriginalText();
239
+ const improved = this.getImprovedText();
240
+
241
+ // Calculate statistics
242
+ this.calculateStatistics(original, improved);
243
+
244
+ // Use diff utility based on mode
245
+ switch(this.diffMode) {
246
+ case 'char':
247
+ this.diffHtml = createCharacterDiff(original, improved);
248
+ break;
249
+ case 'word':
250
+ this.diffHtml = createWordDiff(original, improved);
251
+ break;
252
+ case 'line':
253
+ this.diffHtml = createLineDiff(original, improved);
254
+ break;
255
+ }
256
+ },
257
+
258
+ calculateStatistics(original, improved) {
259
+ // Calculate similarity
260
+ this.similarity = calculateSimilarity(original, improved);
261
+
262
+ // Character statistics
263
+ const charDiff = this.getCharacterDiffStats(original, improved);
264
+ this.charStats = charDiff;
265
+
266
+ // Word statistics
267
+ const originalWords = original.split(/\s+/).filter(w => w.length > 0);
268
+ const improvedWords = improved.split(/\s+/).filter(w => w.length > 0);
269
+ this.wordStats = {
270
+ original: originalWords.length,
271
+ improved: improvedWords.length
272
+ };
273
+ },
274
+
275
+ getCharacterDiffStats(original, improved) {
276
+ const dp = computeLCS(original, improved);
277
+ const diff = buildDiff(original, improved, dp);
278
+
279
+ let added = 0;
280
+ let removed = 0;
281
+ let unchanged = 0;
282
+
283
+ for (const part of diff) {
284
+ if (part.type === 'insert') {
285
+ added += part.value.length;
286
+ } else if (part.type === 'delete') {
287
+ removed += part.value.length;
288
+ } else {
289
+ unchanged += part.value.length;
290
+ }
291
+ }
292
+
293
+ return {
294
+ total: original.length,
295
+ added: added,
296
+ removed: removed,
297
+ unchanged: unchanged
298
+ };
299
+ },
300
+
301
+ async handleImageError(event) {
302
+ // Try to refresh the image URL
303
+ console.log('Image failed to load, refreshing URL...');
304
+ try {
305
+ const data = await this.api.refreshImageUrl(
306
+ this.datasetId,
307
+ this.datasetConfig,
308
+ this.datasetSplit,
309
+ this.currentIndex
310
+ );
311
+
312
+ // Update the image source
313
+ if (data.row && data.row[this.api.detectColumns(null, data.row).image]?.src) {
314
+ event.target.src = data.row[this.api.detectColumns(null, data.row).image].src;
315
+ }
316
+ } catch (error) {
317
+ console.error('Failed to refresh image URL:', error);
318
+ // Set a placeholder image
319
+ event.target.src = '';
320
+ }
321
+ },
322
+
323
+ exportComparison() {
324
+ const original = this.getOriginalText();
325
+ const improved = this.getImprovedText();
326
+ const metadata = {
327
+ dataset: this.datasetId,
328
+ page: this.currentIndex + 1,
329
+ totalPages: this.totalSamples,
330
+ exportDate: new Date().toISOString(),
331
+ similarity: `${this.similarity}%`,
332
+ statistics: {
333
+ characters: this.charStats,
334
+ words: this.wordStats
335
+ }
336
+ };
337
+
338
+ // Create export content
339
+ let content = `OCR Text Comparison Export\n`;
340
+ content += `==========================\n\n`;
341
+ content += `Dataset: ${metadata.dataset}\n`;
342
+ content += `Page: ${metadata.page} of ${metadata.totalPages}\n`;
343
+ content += `Export Date: ${new Date().toLocaleString()}\n`;
344
+ content += `Similarity: ${metadata.similarity}\n`;
345
+ content += `Characters: ${metadata.statistics.characters.total} total, `;
346
+ content += `${metadata.statistics.characters.added} added, `;
347
+ content += `${metadata.statistics.characters.removed} removed\n`;
348
+ content += `Words: ${metadata.statistics.words.original} → ${metadata.statistics.words.improved}\n`;
349
+ content += `\n${'='.repeat(50)}\n\n`;
350
+ content += `ORIGINAL OCR:\n`;
351
+ content += `${'='.repeat(50)}\n`;
352
+ content += original;
353
+ content += `\n\n${'='.repeat(50)}\n\n`;
354
+ content += `IMPROVED OCR:\n`;
355
+ content += `${'='.repeat(50)}\n`;
356
+ content += improved;
357
+
358
+ // Download file
359
+ const blob = new Blob([content], { type: 'text/plain' });
360
+ const url = URL.createObjectURL(blob);
361
+ const a = document.createElement('a');
362
+ a.href = url;
363
+ a.download = `ocr-comparison-${this.datasetId.replace('/', '-')}-page-${this.currentIndex + 1}.txt`;
364
+ document.body.appendChild(a);
365
+ a.click();
366
+ document.body.removeChild(a);
367
+ URL.revokeObjectURL(url);
368
+ },
369
+
370
+ // Flow view methods
371
+ async toggleFlowView() {
372
+ this.showFlowView = !this.showFlowView;
373
+ if (this.showFlowView) {
374
+ // Reset to center around current page when opening
375
+ this.flowStartIndex = Math.max(0, this.currentIndex - Math.floor(this.flowVisibleCount / 2));
376
+ await this.loadFlowItems();
377
+ }
378
+ },
379
+
380
+ async loadFlowItems() {
381
+ // Load thumbnails from flowStartIndex
382
+ const startIdx = this.flowStartIndex;
383
+ this.flowItems = [];
384
+
385
+ // Load visible items
386
+ for (let i = 0; i < this.flowVisibleCount && (startIdx + i) < this.totalSamples; i++) {
387
+ const idx = startIdx + i;
388
+ try {
389
+ const data = await this.api.getRow(
390
+ this.datasetId,
391
+ this.datasetConfig,
392
+ this.datasetSplit,
393
+ idx
394
+ );
395
+
396
+ const columns = this.api.detectColumns(null, data.row);
397
+ const imageData = columns.image ? data.row[columns.image] : null;
398
+
399
+ this.flowItems.push({
400
+ index: idx,
401
+ imageSrc: imageData?.src || '',
402
+ row: data.row
403
+ });
404
+ } catch (error) {
405
+ console.error(`Failed to load flow item ${idx}:`, error);
406
+ }
407
+ }
408
+ },
409
+
410
+ scrollFlowLeft() {
411
+ if (this.flowStartIndex > 0) {
412
+ this.flowStartIndex = Math.max(0, this.flowStartIndex - this.flowVisibleCount);
413
+ this.loadFlowItems();
414
+ }
415
+ },
416
+
417
+ scrollFlowRight() {
418
+ if (this.flowStartIndex < this.totalSamples - this.flowVisibleCount) {
419
+ this.flowStartIndex = Math.min(
420
+ this.totalSamples - this.flowVisibleCount,
421
+ this.flowStartIndex + this.flowVisibleCount
422
+ );
423
+ this.loadFlowItems();
424
+ }
425
+ },
426
+
427
+ async jumpToFlowPage(index) {
428
+ this.showFlowView = false;
429
+ await this.loadSample(index);
430
+ },
431
+
432
+ async handleFlowImageError(event, index) {
433
+ // Try to refresh the image URL for flow item
434
+ try {
435
+ const data = await this.api.refreshImageUrl(
436
+ this.datasetId,
437
+ this.datasetConfig,
438
+ this.datasetSplit,
439
+ index
440
+ );
441
+
442
+ if (data.row) {
443
+ const columns = this.api.detectColumns(null, data.row);
444
+ const imageData = columns.image ? data.row[columns.image] : null;
445
+ if (imageData?.src) {
446
+ event.target.src = imageData.src;
447
+ // Update the flow item
448
+ const flowItem = this.flowItems.find(item => item.index === index);
449
+ if (flowItem) {
450
+ flowItem.imageSrc = imageData.src;
451
+ }
452
+ }
453
+ }
454
+ } catch (error) {
455
+ console.error('Failed to refresh flow image URL:', error);
456
+ }
457
+ },
458
+
459
+ // Dock methods
460
+ async showDockPreview() {
461
+ // Clear any hide timeout
462
+ if (this.dockHideTimeout) {
463
+ clearTimeout(this.dockHideTimeout);
464
+ this.dockHideTimeout = null;
465
+ }
466
+
467
+ this.showDock = true;
468
+
469
+ // Center dock around current page
470
+ this.dockStartIndex = Math.max(0,
471
+ Math.min(
472
+ this.currentIndex - Math.floor(this.dockVisibleCount / 2),
473
+ this.totalSamples - this.dockVisibleCount
474
+ )
475
+ );
476
+
477
+ // Always reload dock items to show current position
478
+ await this.loadDockItems();
479
+ },
480
+
481
+ hideDockPreview() {
482
+ // Add a small delay to prevent flickering
483
+ this.dockHideTimeout = setTimeout(() => {
484
+ this.showDock = false;
485
+ }, 300);
486
+ },
487
+
488
+ async loadDockItems() {
489
+ // Load thumbnails based on dock start index
490
+ const endIdx = Math.min(this.totalSamples, this.dockStartIndex + this.dockVisibleCount);
491
+
492
+ this.dockItems = [];
493
+
494
+ for (let i = this.dockStartIndex; i < endIdx; i++) {
495
+ try {
496
+ const data = await this.api.getRow(
497
+ this.datasetId,
498
+ this.datasetConfig,
499
+ this.datasetSplit,
500
+ i
501
+ );
502
+
503
+ const columns = this.api.detectColumns(null, data.row);
504
+ const imageData = columns.image ? data.row[columns.image] : null;
505
+
506
+ this.dockItems.push({
507
+ index: i,
508
+ imageSrc: imageData?.src || '',
509
+ row: data.row
510
+ });
511
+ } catch (error) {
512
+ console.error(`Failed to load dock item ${i}:`, error);
513
+ }
514
+ }
515
+ },
516
+
517
+ async scrollDockLeft() {
518
+ if (this.dockStartIndex > 0) {
519
+ this.dockStartIndex = Math.max(0, this.dockStartIndex - Math.floor(this.dockVisibleCount / 2));
520
+ await this.loadDockItems();
521
+ }
522
+ },
523
+
524
+ async scrollDockRight() {
525
+ if (this.dockStartIndex < this.totalSamples - this.dockVisibleCount) {
526
+ this.dockStartIndex = Math.min(
527
+ this.totalSamples - this.dockVisibleCount,
528
+ this.dockStartIndex + Math.floor(this.dockVisibleCount / 2)
529
+ );
530
+ await this.loadDockItems();
531
+ }
532
+ },
533
+
534
+ async jumpToDockPage(index) {
535
+ this.showDock = false;
536
+ await this.loadSample(index);
537
+ },
538
+
539
+ // Watch for diff mode changes
540
+ initWatchers() {
541
+ this.$watch('diffMode', () => this.updateDiff());
542
+ this.$watch('currentSample', () => this.updateDiff());
543
+ }
544
+ }));
545
+ });
546
+
547
+ // Initialize watchers after Alpine loads
548
+ document.addEventListener('alpine:initialized', () => {
549
+ Alpine.store('ocrExplorer')?.initWatchers?.();
550
+ });
js/dataset-api.js ADDED
@@ -0,0 +1,273 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ /**
2
+ * HuggingFace Dataset Viewer API wrapper
3
+ * Handles fetching data from the datasets-server API with caching and error handling
4
+ */
5
+
6
+ class DatasetAPI {
7
+ constructor() {
8
+ this.baseURL = 'https://datasets-server.huggingface.co';
9
+ this.cache = new Map();
10
+ this.cacheExpiry = 45 * 60 * 1000; // 45 minutes (conservative for signed URLs)
11
+ this.rowsPerFetch = 100; // API maximum
12
+ }
13
+
14
+ /**
15
+ * Check if a dataset is valid and has viewer enabled
16
+ */
17
+ async validateDataset(datasetId) {
18
+ try {
19
+ const response = await fetch(`${this.baseURL}/is-valid?dataset=${encodeURIComponent(datasetId)}`);
20
+ if (!response.ok) {
21
+ throw new Error(`Failed to validate dataset: ${response.statusText}`);
22
+ }
23
+ const data = await response.json();
24
+
25
+ if (!data.viewer) {
26
+ throw new Error('Dataset viewer is not available for this dataset');
27
+ }
28
+
29
+ return true;
30
+ } catch (error) {
31
+ throw new Error(`Dataset validation failed: ${error.message}`);
32
+ }
33
+ }
34
+
35
+ /**
36
+ * Get dataset info including splits and configs
37
+ */
38
+ async getDatasetInfo(datasetId) {
39
+ const cacheKey = `info_${datasetId}`;
40
+ const cached = this.getFromCache(cacheKey);
41
+ if (cached) return cached;
42
+
43
+ try {
44
+ const response = await fetch(`${this.baseURL}/splits?dataset=${encodeURIComponent(datasetId)}`);
45
+ if (!response.ok) {
46
+ throw new Error(`Failed to get dataset info: ${response.statusText}`);
47
+ }
48
+ const data = await response.json();
49
+
50
+ // Extract the default config and split
51
+ const defaultConfig = data.splits[0]?.config || 'default';
52
+ const defaultSplit = data.splits.find(s => s.split === 'train')?.split || data.splits[0]?.split || 'train';
53
+
54
+ const info = {
55
+ configs: [...new Set(data.splits.map(s => s.config))],
56
+ splits: [...new Set(data.splits.map(s => s.split))],
57
+ defaultConfig,
58
+ defaultSplit,
59
+ raw: data
60
+ };
61
+
62
+ this.setCache(cacheKey, info);
63
+ return info;
64
+ } catch (error) {
65
+ throw new Error(`Failed to get dataset info: ${error.message}`);
66
+ }
67
+ }
68
+
69
+ /**
70
+ * Get the total number of rows in a dataset
71
+ */
72
+ async getTotalRows(datasetId, config, split) {
73
+ const cacheKey = `size_${datasetId}_${config}_${split}`;
74
+ const cached = this.getFromCache(cacheKey);
75
+ if (cached) return cached;
76
+
77
+ try {
78
+ // First try to get from the size endpoint
79
+ const sizeResponse = await fetch(
80
+ `${this.baseURL}/size?dataset=${encodeURIComponent(datasetId)}&config=${encodeURIComponent(config)}&split=${encodeURIComponent(split)}`
81
+ );
82
+
83
+ if (sizeResponse.ok) {
84
+ const sizeData = await sizeResponse.json();
85
+ // The API returns num_rows in size.config or size.splits[0]
86
+ const size = sizeData.size?.config?.num_rows ||
87
+ sizeData.size?.splits?.[0]?.num_rows ||
88
+ 0;
89
+ this.setCache(cacheKey, size);
90
+ return size;
91
+ }
92
+
93
+ // Fallback: get first rows and check num_rows_total
94
+ const rowsResponse = await fetch(
95
+ `${this.baseURL}/first-rows?dataset=${encodeURIComponent(datasetId)}&config=${encodeURIComponent(config)}&split=${encodeURIComponent(split)}`
96
+ );
97
+
98
+ if (!rowsResponse.ok) {
99
+ throw new Error('Unable to determine dataset size');
100
+ }
101
+
102
+ const rowsData = await rowsResponse.json();
103
+ const size = rowsData.num_rows_total || rowsData.rows?.length || 0;
104
+ this.setCache(cacheKey, size);
105
+ return size;
106
+ } catch (error) {
107
+ console.warn('Failed to get total rows:', error);
108
+ return null;
109
+ }
110
+ }
111
+
112
+ /**
113
+ * Fetch rows from the dataset
114
+ */
115
+ async fetchRows(datasetId, config, split, offset, length = this.rowsPerFetch) {
116
+ const cacheKey = `rows_${datasetId}_${config}_${split}_${offset}_${length}`;
117
+ const cached = this.getFromCache(cacheKey);
118
+ if (cached) return cached;
119
+
120
+ try {
121
+ const response = await fetch(
122
+ `${this.baseURL}/rows?dataset=${encodeURIComponent(datasetId)}&config=${encodeURIComponent(config)}&split=${encodeURIComponent(split)}&offset=${offset}&length=${length}`
123
+ );
124
+
125
+ if (!response.ok) {
126
+ if (response.status === 403) {
127
+ throw new Error('Access denied. This dataset may be private or gated.');
128
+ }
129
+ throw new Error(`Failed to fetch rows: ${response.statusText}`);
130
+ }
131
+
132
+ const data = await response.json();
133
+
134
+ // Extract column information
135
+ const columns = this.detectColumns(data.features, data.rows[0]?.row);
136
+
137
+ const result = {
138
+ rows: data.rows,
139
+ features: data.features,
140
+ columns,
141
+ numRowsTotal: data.num_rows_total,
142
+ partial: data.partial || false
143
+ };
144
+
145
+ this.setCache(cacheKey, result);
146
+ return result;
147
+ } catch (error) {
148
+ throw new Error(`Failed to fetch rows: ${error.message}`);
149
+ }
150
+ }
151
+
152
+ /**
153
+ * Get a single row by index with smart batching
154
+ */
155
+ async getRow(datasetId, config, split, index) {
156
+ // Calculate which batch this index falls into
157
+ const batchStart = Math.floor(index / this.rowsPerFetch) * this.rowsPerFetch;
158
+ const batchData = await this.fetchRows(datasetId, config, split, batchStart, this.rowsPerFetch);
159
+
160
+ const localIndex = index - batchStart;
161
+ if (localIndex >= 0 && localIndex < batchData.rows.length) {
162
+ return {
163
+ row: batchData.rows[localIndex].row,
164
+ columns: batchData.columns,
165
+ numRowsTotal: batchData.numRowsTotal
166
+ };
167
+ }
168
+
169
+ throw new Error(`Row ${index} not found`);
170
+ }
171
+
172
+ /**
173
+ * Detect column names for image and text data
174
+ */
175
+ detectColumns(features, sampleRow) {
176
+ let imageColumn = null;
177
+ let originalTextColumn = null;
178
+ let improvedTextColumn = null;
179
+
180
+ // Try to detect from features first
181
+ for (const feature of features || []) {
182
+ const name = feature.name;
183
+ const type = feature.type;
184
+
185
+ // Detect image column
186
+ if (type._type === 'Image' || type.dtype === 'image' || type.feature?._type === 'Image') {
187
+ imageColumn = name;
188
+ }
189
+
190
+ // Detect text columns based on common patterns
191
+ if (!originalTextColumn && ['text', 'ocr', 'original_text', 'original', 'ground_truth'].includes(name)) {
192
+ originalTextColumn = name;
193
+ }
194
+
195
+ if (!improvedTextColumn && ['markdown', 'new_ocr', 'corrected_text', 'improved', 'vlm_ocr', 'corrected'].includes(name)) {
196
+ improvedTextColumn = name;
197
+ }
198
+ }
199
+
200
+ // Fallback: detect from sample row
201
+ if (sampleRow) {
202
+ const keys = Object.keys(sampleRow);
203
+
204
+ if (!imageColumn) {
205
+ for (const key of keys) {
206
+ if (sampleRow[key]?.src && sampleRow[key]?.height !== undefined) {
207
+ imageColumn = key;
208
+ break;
209
+ }
210
+ }
211
+ }
212
+
213
+ // Additional text column detection from row data
214
+ if (!originalTextColumn) {
215
+ const candidates = ['text', 'ocr', 'original_text', 'original'];
216
+ originalTextColumn = keys.find(k => candidates.includes(k)) || null;
217
+ }
218
+
219
+ if (!improvedTextColumn) {
220
+ const candidates = ['markdown', 'new_ocr', 'corrected_text', 'improved'];
221
+ improvedTextColumn = keys.find(k => candidates.includes(k)) || null;
222
+ }
223
+ }
224
+
225
+ return {
226
+ image: imageColumn,
227
+ originalText: originalTextColumn,
228
+ improvedText: improvedTextColumn
229
+ };
230
+ }
231
+
232
+ /**
233
+ * Refresh expired image URL by re-fetching the row
234
+ */
235
+ async refreshImageUrl(datasetId, config, split, index) {
236
+ // Clear cache for this specific row batch
237
+ const batchStart = Math.floor(index / this.rowsPerFetch) * this.rowsPerFetch;
238
+ const cacheKey = `rows_${datasetId}_${config}_${split}_${batchStart}_${this.rowsPerFetch}`;
239
+ this.cache.delete(cacheKey);
240
+
241
+ // Re-fetch the row
242
+ return await this.getRow(datasetId, config, split, index);
243
+ }
244
+
245
+ /**
246
+ * Cache management utilities
247
+ */
248
+ getFromCache(key) {
249
+ const cached = this.cache.get(key);
250
+ if (!cached) return null;
251
+
252
+ if (Date.now() - cached.timestamp > this.cacheExpiry) {
253
+ this.cache.delete(key);
254
+ return null;
255
+ }
256
+
257
+ return cached.data;
258
+ }
259
+
260
+ setCache(key, data) {
261
+ this.cache.set(key, {
262
+ data,
263
+ timestamp: Date.now()
264
+ });
265
+ }
266
+
267
+ clearCache() {
268
+ this.cache.clear();
269
+ }
270
+ }
271
+
272
+ // Export for use in other scripts
273
+ window.DatasetAPI = DatasetAPI;
js/diff-utils.js ADDED
@@ -0,0 +1,219 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ /**
2
+ * Text comparison utilities for OCR Text Explorer
3
+ * Provides character, word, and line-level diff visualization
4
+ */
5
+
6
+ /**
7
+ * Create character-level diff with inline highlighting
8
+ */
9
+ function createCharacterDiff(original, improved) {
10
+ if (!original || !improved) {
11
+ return '<p class="text-gray-500">No text to compare</p>';
12
+ }
13
+
14
+ const dp = computeLCS(original, improved);
15
+ const diff = buildDiff(original, improved, dp);
16
+
17
+ let html = '<div class="font-mono text-sm whitespace-pre-wrap text-gray-900 dark:text-gray-100">';
18
+
19
+ for (const part of diff) {
20
+ if (part.type === 'equal') {
21
+ html += escapeHtml(part.value);
22
+ } else if (part.type === 'delete') {
23
+ html += `<span class="bg-red-200 dark:bg-red-950 text-red-800 dark:text-red-300 line-through">${escapeHtml(part.value)}</span>`;
24
+ } else if (part.type === 'insert') {
25
+ html += `<span class="bg-green-200 dark:bg-green-950 text-green-800 dark:text-green-300">${escapeHtml(part.value)}</span>`;
26
+ }
27
+ }
28
+
29
+ html += '</div>';
30
+ return html;
31
+ }
32
+
33
+ /**
34
+ * Create word-level diff
35
+ */
36
+ function createWordDiff(original, improved) {
37
+ if (!original || !improved) {
38
+ return '<p class="text-gray-500">No text to compare</p>';
39
+ }
40
+
41
+ // Split into words while preserving whitespace
42
+ const originalWords = splitIntoWords(original);
43
+ const improvedWords = splitIntoWords(improved);
44
+
45
+ const dp = computeLCS(originalWords, improvedWords);
46
+ const diff = buildDiff(originalWords, improvedWords, dp);
47
+
48
+ let html = '<div class="font-mono text-sm whitespace-pre-wrap text-gray-900 dark:text-gray-100">';
49
+
50
+ for (const part of diff) {
51
+ if (part.type === 'equal') {
52
+ html += escapeHtml(part.value.join(''));
53
+ } else if (part.type === 'delete') {
54
+ html += `<span class="bg-red-200 dark:bg-red-950 text-red-800 dark:text-red-300 line-through">${escapeHtml(part.value.join(''))}</span>`;
55
+ } else if (part.type === 'insert') {
56
+ html += `<span class="bg-green-200 dark:bg-green-950 text-green-800 dark:text-green-300">${escapeHtml(part.value.join(''))}</span>`;
57
+ }
58
+ }
59
+
60
+ html += '</div>';
61
+ return html;
62
+ }
63
+
64
+ /**
65
+ * Create line-level diff
66
+ */
67
+ function createLineDiff(original, improved) {
68
+ if (!original || !improved) {
69
+ return '<p class="text-gray-500">No text to compare</p>';
70
+ }
71
+
72
+ const originalLines = original.split('\n');
73
+ const improvedLines = improved.split('\n');
74
+
75
+ const dp = computeLCS(originalLines, improvedLines);
76
+ const diff = buildDiff(originalLines, improvedLines, dp);
77
+
78
+ let html = '<div class="font-mono text-sm text-gray-900 dark:text-gray-100">';
79
+
80
+ for (const part of diff) {
81
+ if (part.type === 'equal') {
82
+ for (const line of part.value) {
83
+ html += `<div class="py-1">${escapeHtml(line)}</div>`;
84
+ }
85
+ } else if (part.type === 'delete') {
86
+ for (const line of part.value) {
87
+ html += `<div class="py-1 bg-red-200 dark:bg-red-950 text-red-800 dark:text-red-300 line-through">${escapeHtml(line)}</div>`;
88
+ }
89
+ } else if (part.type === 'insert') {
90
+ for (const line of part.value) {
91
+ html += `<div class="py-1 bg-green-200 dark:bg-green-950 text-green-800 dark:text-green-300">${escapeHtml(line)}</div>`;
92
+ }
93
+ }
94
+ }
95
+
96
+ html += '</div>';
97
+ return html;
98
+ }
99
+
100
+ /**
101
+ * Compute Longest Common Subsequence using dynamic programming
102
+ */
103
+ function computeLCS(a, b) {
104
+ const m = a.length;
105
+ const n = b.length;
106
+ const dp = Array(m + 1).fill(null).map(() => Array(n + 1).fill(0));
107
+
108
+ for (let i = 1; i <= m; i++) {
109
+ for (let j = 1; j <= n; j++) {
110
+ if (a[i - 1] === b[j - 1]) {
111
+ dp[i][j] = dp[i - 1][j - 1] + 1;
112
+ } else {
113
+ dp[i][j] = Math.max(dp[i - 1][j], dp[i][j - 1]);
114
+ }
115
+ }
116
+ }
117
+
118
+ return dp;
119
+ }
120
+
121
+ /**
122
+ * Build diff from LCS table
123
+ */
124
+ function buildDiff(a, b, dp) {
125
+ const diff = [];
126
+ let i = a.length;
127
+ let j = b.length;
128
+
129
+ while (i > 0 || j > 0) {
130
+ if (i > 0 && j > 0 && a[i - 1] === b[j - 1]) {
131
+ // Characters are equal
132
+ if (diff.length > 0 && diff[diff.length - 1].type === 'equal') {
133
+ diff[diff.length - 1].value.unshift(a[i - 1]);
134
+ } else {
135
+ diff.push({ type: 'equal', value: [a[i - 1]] });
136
+ }
137
+ i--;
138
+ j--;
139
+ } else if (j > 0 && (i === 0 || dp[i][j - 1] >= dp[i - 1][j])) {
140
+ // Character in b but not in a (insertion)
141
+ if (diff.length > 0 && diff[diff.length - 1].type === 'insert') {
142
+ diff[diff.length - 1].value.unshift(b[j - 1]);
143
+ } else {
144
+ diff.push({ type: 'insert', value: [b[j - 1]] });
145
+ }
146
+ j--;
147
+ } else {
148
+ // Character in a but not in b (deletion)
149
+ if (diff.length > 0 && diff[diff.length - 1].type === 'delete') {
150
+ diff[diff.length - 1].value.unshift(a[i - 1]);
151
+ } else {
152
+ diff.push({ type: 'delete', value: [a[i - 1]] });
153
+ }
154
+ i--;
155
+ }
156
+ }
157
+
158
+ diff.reverse();
159
+
160
+ // Convert arrays to strings for character diff
161
+ if (typeof a === 'string') {
162
+ diff.forEach(part => {
163
+ part.value = part.value.join('');
164
+ });
165
+ }
166
+
167
+ return diff;
168
+ }
169
+
170
+ /**
171
+ * Split text into words while preserving whitespace
172
+ */
173
+ function splitIntoWords(text) {
174
+ const words = [];
175
+ let current = '';
176
+ let inWord = false;
177
+
178
+ for (const char of text) {
179
+ if (/\s/.test(char)) {
180
+ if (inWord && current) {
181
+ words.push(current);
182
+ current = '';
183
+ inWord = false;
184
+ }
185
+ words.push(char);
186
+ } else {
187
+ current += char;
188
+ inWord = true;
189
+ }
190
+ }
191
+
192
+ if (current) {
193
+ words.push(current);
194
+ }
195
+
196
+ return words;
197
+ }
198
+
199
+ /**
200
+ * Escape HTML special characters
201
+ */
202
+ function escapeHtml(text) {
203
+ const div = document.createElement('div');
204
+ div.textContent = text;
205
+ return div.innerHTML;
206
+ }
207
+
208
+ /**
209
+ * Calculate similarity percentage between two texts
210
+ */
211
+ function calculateSimilarity(original, improved) {
212
+ if (!original || !improved) return 0;
213
+
214
+ const dp = computeLCS(original, improved);
215
+ const lcsLength = dp[original.length][improved.length];
216
+ const maxLength = Math.max(original.length, improved.length);
217
+
218
+ return Math.round((lcsLength / maxLength) * 100);
219
+ }