Spaces:
Running
Running
Commit
·
c49cb47
1
Parent(s):
b9aef14
Configure OCR Time Capsule with default dataset and branding
Browse files- CLAUDE.md +186 -0
- css/styles.css +197 -0
- js/app.js +550 -0
- js/dataset-api.js +273 -0
- js/diff-utils.js +219 -0
CLAUDE.md
ADDED
@@ -0,0 +1,186 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# CLAUDE.md
|
2 |
+
|
3 |
+
This file provides guidance to Claude Code (claude.ai/code) when working with the OCR Text Explorer.
|
4 |
+
|
5 |
+
## Project Overview
|
6 |
+
|
7 |
+
OCR Text Explorer is a modern, standalone web application for browsing and comparing OCR text improvements in HuggingFace datasets. Built as a lightweight alternative to the Gradio-based OCR Time Machine, it focuses specifically on exploring pre-OCR'd datasets with enhanced user experience.
|
8 |
+
|
9 |
+
## Architecture
|
10 |
+
|
11 |
+
### Technology Stack
|
12 |
+
- **Frontend Framework**: Alpine.js (lightweight reactivity, ~15KB)
|
13 |
+
- **Styling**: Tailwind CSS (utility-first, responsive design)
|
14 |
+
- **Interactions**: HTMX (server-side rendering capabilities)
|
15 |
+
- **API**: HuggingFace Dataset Viewer API (no backend required)
|
16 |
+
- **Language**: Vanilla JavaScript (no build process needed)
|
17 |
+
|
18 |
+
### Core Components
|
19 |
+
|
20 |
+
**index.html** - Main application shell
|
21 |
+
- Split-pane layout (1/3 image, 2/3 text comparison)
|
22 |
+
- Three view modes: Side-by-side, Inline diff, Improved only
|
23 |
+
- Dark mode support with proper contrast
|
24 |
+
- Responsive design for mobile devices
|
25 |
+
|
26 |
+
**js/dataset-api.js** - HuggingFace API wrapper
|
27 |
+
- Smart caching with 45-minute expiration for signed URLs
|
28 |
+
- Batch loading (100 rows at a time)
|
29 |
+
- Automatic column detection for different dataset schemas
|
30 |
+
- Image URL refresh on expiration
|
31 |
+
|
32 |
+
**js/app.js** - Alpine.js application logic
|
33 |
+
- Keyboard navigation (J/K, arrows)
|
34 |
+
- URL state management for shareable links
|
35 |
+
- Diff mode switching (character/word/line)
|
36 |
+
- Dark mode persistence in localStorage
|
37 |
+
|
38 |
+
**js/diff-utils.js** - Text comparison algorithms
|
39 |
+
- Character-level diff with inline highlighting
|
40 |
+
- Word-level diff preserving whitespace
|
41 |
+
- Line-level diff for larger changes
|
42 |
+
- LCS (Longest Common Subsequence) implementation
|
43 |
+
|
44 |
+
**css/styles.css** - Custom styling
|
45 |
+
- Dark mode enhancements
|
46 |
+
- Diff highlighting with accessibility in mind
|
47 |
+
- Smooth transitions and animations
|
48 |
+
- Print-friendly styles
|
49 |
+
|
50 |
+
## Key Design Decisions
|
51 |
+
|
52 |
+
### Why Separate from OCR Time Machine?
|
53 |
+
|
54 |
+
1. **Focused Purpose**: OCR Time Machine is for live OCR processing with VLMs (requires GPU), while this explorer is for browsing pre-processed results
|
55 |
+
2. **Performance**: No Python/Gradio overhead - instant loading and navigation
|
56 |
+
3. **User Experience**: Custom UI optimized for text comparison workflows
|
57 |
+
4. **Deployment**: Static files can be hosted anywhere (GitHub Pages, CDN, etc.)
|
58 |
+
|
59 |
+
### API vs Backend Trade-offs
|
60 |
+
|
61 |
+
**Chose HF Dataset Viewer API because:**
|
62 |
+
- No backend infrastructure needed
|
63 |
+
- Automatic image serving with CDN
|
64 |
+
- Built-in pagination support
|
65 |
+
- Works with any public HF dataset
|
66 |
+
|
67 |
+
**Limitations accepted:**
|
68 |
+
- Image URLs expire (~1 hour)
|
69 |
+
- 100 rows max per request
|
70 |
+
- No write capabilities
|
71 |
+
- Public datasets only (no auth yet)
|
72 |
+
|
73 |
+
### UI/UX Principles
|
74 |
+
|
75 |
+
1. **Keyboard-first**: Professional users prefer keyboard navigation
|
76 |
+
2. **Information density**: Show more content, less chrome
|
77 |
+
3. **Visual diff**: Color-coded changes are easier to scan than side-by-side
|
78 |
+
4. **Dark mode**: Essential for extended reading sessions
|
79 |
+
5. **Responsive**: Works on tablets for field work
|
80 |
+
|
81 |
+
## Development Approach
|
82 |
+
|
83 |
+
### Phase 1: MVP (Completed)
|
84 |
+
- Basic dataset loading and navigation
|
85 |
+
- Side-by-side text comparison
|
86 |
+
- Keyboard shortcuts
|
87 |
+
- Dark mode
|
88 |
+
|
89 |
+
### Phase 2: Enhancements (Completed)
|
90 |
+
- Three diff algorithms (char/word/line)
|
91 |
+
- URL state management
|
92 |
+
- Image error handling with refresh
|
93 |
+
- Responsive mobile layout
|
94 |
+
|
95 |
+
### Phase 3: Polish (Completed)
|
96 |
+
- Fixed dark mode contrast issues
|
97 |
+
- Optimized performance with direct indexing
|
98 |
+
- Added loading states and error handling
|
99 |
+
- Comprehensive documentation
|
100 |
+
|
101 |
+
## Common Tasks
|
102 |
+
|
103 |
+
### Adding Column Name Patterns
|
104 |
+
```javascript
|
105 |
+
// In dataset-api.js detectColumns() method
|
106 |
+
if (!originalTextColumn && ['your_column_name'].includes(name)) {
|
107 |
+
originalTextColumn = name;
|
108 |
+
}
|
109 |
+
```
|
110 |
+
|
111 |
+
### Adding Keyboard Shortcuts
|
112 |
+
```javascript
|
113 |
+
// In app.js setupKeyboardNavigation()
|
114 |
+
case 'your_key':
|
115 |
+
// Your action
|
116 |
+
break;
|
117 |
+
```
|
118 |
+
|
119 |
+
### Customizing Diff Colors
|
120 |
+
```javascript
|
121 |
+
// In diff-utils.js
|
122 |
+
// Light mode: bg-red-200, text-red-800
|
123 |
+
// Dark mode: bg-red-950, text-red-300
|
124 |
+
```
|
125 |
+
|
126 |
+
## Performance Optimizations
|
127 |
+
|
128 |
+
1. **Direct Dataset Indexing**: Uses `dataset[index]` instead of loading batches into memory
|
129 |
+
2. **Smart Caching**: Caches API responses for 45 minutes (conservative for signed URLs)
|
130 |
+
3. **Batch Fetching**: Loads 100 rows at once, caches for smooth navigation
|
131 |
+
4. **Lazy Loading**: Only fetches data when needed
|
132 |
+
|
133 |
+
## Known Issues & Solutions
|
134 |
+
|
135 |
+
### Issue: Navigation buttons were disabled
|
136 |
+
**Cause**: API response structure wasn't parsed correctly
|
137 |
+
**Fix**: Updated getTotalRows() to check `size.config.num_rows` and `size.splits[0].num_rows`
|
138 |
+
|
139 |
+
### Issue: Dark mode text unreadable
|
140 |
+
**Cause**: Insufficient contrast in diff highlighting and code blocks
|
141 |
+
**Fix**:
|
142 |
+
- Changed diff colors to use `dark:bg-red-950` and `dark:text-red-300`
|
143 |
+
- Added explicit `text-gray-900 dark:text-gray-100` to all text containers
|
144 |
+
|
145 |
+
### Issue: Image loading errors
|
146 |
+
**Cause**: Signed URLs expire after ~1 hour
|
147 |
+
**Fix**: Implemented handleImageError() with automatic URL refresh
|
148 |
+
|
149 |
+
## Future Enhancements
|
150 |
+
|
151 |
+
- [ ] Search/filter within dataset
|
152 |
+
- [ ] Bookmark favorite samples
|
153 |
+
- [ ] Export selected texts
|
154 |
+
- [ ] Support for private datasets (auth)
|
155 |
+
- [ ] Metrics display (CER/WER)
|
156 |
+
- [ ] Batch operations
|
157 |
+
- [ ] PWA support for offline viewing
|
158 |
+
|
159 |
+
## Deployment
|
160 |
+
|
161 |
+
### Static Hosting (Recommended)
|
162 |
+
```bash
|
163 |
+
# Any static file server works
|
164 |
+
python3 -m http.server 8000
|
165 |
+
npx serve .
|
166 |
+
```
|
167 |
+
|
168 |
+
### GitHub Pages
|
169 |
+
1. Push to GitHub repository
|
170 |
+
2. Enable Pages in settings
|
171 |
+
3. Access at: `https://[username].github.io/[repo]/ocr-text-explorer/`
|
172 |
+
|
173 |
+
### CDN Deployment
|
174 |
+
- Upload files to any CDN
|
175 |
+
- No server-side processing needed
|
176 |
+
- Works with CloudFlare, Netlify, Vercel, etc.
|
177 |
+
|
178 |
+
## Testing Datasets
|
179 |
+
|
180 |
+
Known working datasets:
|
181 |
+
- `davanstrien/exams-ocr` - Default dataset with great examples
|
182 |
+
- Any dataset with image + text columns
|
183 |
+
|
184 |
+
Column patterns automatically detected:
|
185 |
+
- Original: `text`, `ocr`, `original_text`, `ground_truth`
|
186 |
+
- Improved: `markdown`, `new_ocr`, `corrected_text`, `vlm_ocr`
|
css/styles.css
ADDED
@@ -0,0 +1,197 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
/**
|
2 |
+
* Custom styles for OCR Text Explorer
|
3 |
+
* Extends Tailwind CSS with specific styling needs
|
4 |
+
*/
|
5 |
+
|
6 |
+
/* Custom scrollbar styling */
|
7 |
+
::-webkit-scrollbar {
|
8 |
+
width: 8px;
|
9 |
+
height: 8px;
|
10 |
+
}
|
11 |
+
|
12 |
+
::-webkit-scrollbar-track {
|
13 |
+
@apply bg-gray-100 dark:bg-gray-800;
|
14 |
+
}
|
15 |
+
|
16 |
+
::-webkit-scrollbar-thumb {
|
17 |
+
@apply bg-gray-400 dark:bg-gray-600 rounded;
|
18 |
+
}
|
19 |
+
|
20 |
+
::-webkit-scrollbar-thumb:hover {
|
21 |
+
@apply bg-gray-500 dark:bg-gray-500;
|
22 |
+
}
|
23 |
+
|
24 |
+
/* Firefox scrollbar */
|
25 |
+
* {
|
26 |
+
scrollbar-width: thin;
|
27 |
+
scrollbar-color: theme('colors.gray.400') theme('colors.gray.100');
|
28 |
+
}
|
29 |
+
|
30 |
+
.dark * {
|
31 |
+
scrollbar-color: theme('colors.gray.600') theme('colors.gray.800');
|
32 |
+
}
|
33 |
+
|
34 |
+
/* Smooth transitions for theme switching */
|
35 |
+
body {
|
36 |
+
transition: background-color 0.3s ease, color 0.3s ease;
|
37 |
+
}
|
38 |
+
|
39 |
+
/* Image panel sticky positioning adjustment */
|
40 |
+
.sticky {
|
41 |
+
position: -webkit-sticky;
|
42 |
+
position: sticky;
|
43 |
+
}
|
44 |
+
|
45 |
+
/* Diff content styling */
|
46 |
+
.diff-content {
|
47 |
+
line-height: 1.6;
|
48 |
+
word-break: break-word;
|
49 |
+
}
|
50 |
+
|
51 |
+
/* Keyboard hint styling */
|
52 |
+
kbd {
|
53 |
+
@apply inline-block px-2 py-1 text-xs font-semibold text-gray-800 bg-gray-100 border border-gray-300 rounded dark:bg-gray-700 dark:text-gray-200 dark:border-gray-600;
|
54 |
+
box-shadow: 0 1px 0 rgba(0, 0, 0, 0.1);
|
55 |
+
}
|
56 |
+
|
57 |
+
/* Loading spinner animation (in case Tailwind's animate-spin needs adjustment) */
|
58 |
+
@keyframes spin {
|
59 |
+
to {
|
60 |
+
transform: rotate(360deg);
|
61 |
+
}
|
62 |
+
}
|
63 |
+
|
64 |
+
.animate-spin {
|
65 |
+
animation: spin 1s linear infinite;
|
66 |
+
}
|
67 |
+
|
68 |
+
/* Tab hover effect */
|
69 |
+
nav button {
|
70 |
+
position: relative;
|
71 |
+
transition: color 0.2s ease;
|
72 |
+
}
|
73 |
+
|
74 |
+
nav button::after {
|
75 |
+
content: '';
|
76 |
+
position: absolute;
|
77 |
+
bottom: -2px;
|
78 |
+
left: 0;
|
79 |
+
right: 0;
|
80 |
+
height: 2px;
|
81 |
+
background-color: transparent;
|
82 |
+
transition: background-color 0.2s ease;
|
83 |
+
}
|
84 |
+
|
85 |
+
nav button:hover::after {
|
86 |
+
@apply bg-gray-300 dark:bg-gray-600;
|
87 |
+
}
|
88 |
+
|
89 |
+
/* Image loading state */
|
90 |
+
img {
|
91 |
+
@apply bg-gray-200 dark:bg-gray-700;
|
92 |
+
min-height: 200px;
|
93 |
+
}
|
94 |
+
|
95 |
+
img[src=""] {
|
96 |
+
visibility: hidden;
|
97 |
+
}
|
98 |
+
|
99 |
+
/* Print styles */
|
100 |
+
@media print {
|
101 |
+
header, footer {
|
102 |
+
display: none !important;
|
103 |
+
}
|
104 |
+
|
105 |
+
.no-print {
|
106 |
+
display: none !important;
|
107 |
+
}
|
108 |
+
|
109 |
+
main {
|
110 |
+
height: auto !important;
|
111 |
+
}
|
112 |
+
|
113 |
+
.diff-content {
|
114 |
+
page-break-inside: avoid;
|
115 |
+
}
|
116 |
+
}
|
117 |
+
|
118 |
+
/* Responsive adjustments */
|
119 |
+
@media (max-width: 768px) {
|
120 |
+
/* Stack panels vertically on mobile */
|
121 |
+
main.flex {
|
122 |
+
@apply flex-col;
|
123 |
+
}
|
124 |
+
|
125 |
+
/* Full width for panels on mobile */
|
126 |
+
main > div:first-child {
|
127 |
+
@apply w-full max-h-96;
|
128 |
+
}
|
129 |
+
|
130 |
+
/* Adjust text size */
|
131 |
+
.prose-sm {
|
132 |
+
@apply text-xs;
|
133 |
+
}
|
134 |
+
|
135 |
+
/* Hide keyboard hints on mobile */
|
136 |
+
footer .text-sm:last-child {
|
137 |
+
@apply hidden;
|
138 |
+
}
|
139 |
+
}
|
140 |
+
|
141 |
+
/* Focus styles for accessibility */
|
142 |
+
button:focus, input:focus, select:focus {
|
143 |
+
@apply outline-none ring-2 ring-blue-500 ring-offset-2 dark:ring-offset-gray-900;
|
144 |
+
}
|
145 |
+
|
146 |
+
/* Custom tooltip styles (if needed later) */
|
147 |
+
.tooltip {
|
148 |
+
@apply invisible absolute z-10 px-2 py-1 text-xs text-white bg-gray-900 rounded shadow-lg dark:bg-gray-700;
|
149 |
+
}
|
150 |
+
|
151 |
+
.tooltip-trigger:hover .tooltip {
|
152 |
+
@apply visible;
|
153 |
+
}
|
154 |
+
|
155 |
+
/* Preserve whitespace in diff views */
|
156 |
+
.whitespace-pre-wrap {
|
157 |
+
white-space: pre-wrap;
|
158 |
+
word-wrap: break-word;
|
159 |
+
}
|
160 |
+
|
161 |
+
/* Enhanced diff highlighting with better dark mode contrast */
|
162 |
+
.diff-delete {
|
163 |
+
@apply bg-red-200 dark:bg-red-950 text-red-800 dark:text-red-300;
|
164 |
+
text-decoration: line-through;
|
165 |
+
text-decoration-color: currentColor;
|
166 |
+
text-decoration-thickness: 2px;
|
167 |
+
}
|
168 |
+
|
169 |
+
.diff-insert {
|
170 |
+
@apply bg-green-200 dark:bg-green-950 text-green-800 dark:text-green-300;
|
171 |
+
position: relative;
|
172 |
+
}
|
173 |
+
|
174 |
+
/* Dark mode specific improvements */
|
175 |
+
.dark .prose {
|
176 |
+
@apply text-gray-200;
|
177 |
+
}
|
178 |
+
|
179 |
+
.dark .prose h3 {
|
180 |
+
@apply text-gray-100;
|
181 |
+
}
|
182 |
+
|
183 |
+
/* Remove this - handled inline with classes
|
184 |
+
.dark pre {
|
185 |
+
@apply bg-gray-800 text-gray-200;
|
186 |
+
} */
|
187 |
+
|
188 |
+
/* Line numbers for future enhancement */
|
189 |
+
.line-numbers {
|
190 |
+
counter-reset: line;
|
191 |
+
}
|
192 |
+
|
193 |
+
.line-numbers > div::before {
|
194 |
+
counter-increment: line;
|
195 |
+
content: counter(line);
|
196 |
+
@apply inline-block w-12 mr-4 text-right text-gray-400 dark:text-gray-600 select-none;
|
197 |
+
}
|
js/app.js
ADDED
@@ -0,0 +1,550 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
/**
|
2 |
+
* Main Alpine.js application for OCR Text Explorer
|
3 |
+
*/
|
4 |
+
|
5 |
+
document.addEventListener('alpine:init', () => {
|
6 |
+
Alpine.data('ocrExplorer', () => ({
|
7 |
+
// Dataset state
|
8 |
+
datasetId: 'davanstrien/exams-ocr',
|
9 |
+
datasetConfig: 'default',
|
10 |
+
datasetSplit: 'train',
|
11 |
+
|
12 |
+
// Navigation state
|
13 |
+
currentIndex: 0,
|
14 |
+
totalSamples: null,
|
15 |
+
currentSample: null,
|
16 |
+
jumpToPage: '',
|
17 |
+
|
18 |
+
// UI state
|
19 |
+
loading: false,
|
20 |
+
error: null,
|
21 |
+
activeTab: 'comparison',
|
22 |
+
diffMode: 'char',
|
23 |
+
darkMode: false,
|
24 |
+
showAbout: false,
|
25 |
+
showFlowView: false,
|
26 |
+
showDock: false,
|
27 |
+
|
28 |
+
// Flow view state
|
29 |
+
flowItems: [],
|
30 |
+
flowStartIndex: 0,
|
31 |
+
flowVisibleCount: 7,
|
32 |
+
flowOffset: 0,
|
33 |
+
|
34 |
+
// Dock state
|
35 |
+
dockItems: [],
|
36 |
+
dockHideTimeout: null,
|
37 |
+
dockStartIndex: 0,
|
38 |
+
dockVisibleCount: 10,
|
39 |
+
|
40 |
+
// Computed diff HTML
|
41 |
+
diffHtml: '',
|
42 |
+
|
43 |
+
// Statistics
|
44 |
+
similarity: 0,
|
45 |
+
charStats: { total: 0, added: 0, removed: 0 },
|
46 |
+
wordStats: { original: 0, improved: 0 },
|
47 |
+
|
48 |
+
// API instance
|
49 |
+
api: null,
|
50 |
+
|
51 |
+
async init() {
|
52 |
+
// Initialize API
|
53 |
+
this.api = new DatasetAPI();
|
54 |
+
|
55 |
+
// Apply dark mode from localStorage
|
56 |
+
this.darkMode = localStorage.getItem('darkMode') === 'true';
|
57 |
+
this.$watch('darkMode', value => {
|
58 |
+
localStorage.setItem('darkMode', value);
|
59 |
+
document.documentElement.classList.toggle('dark', value);
|
60 |
+
});
|
61 |
+
document.documentElement.classList.toggle('dark', this.darkMode);
|
62 |
+
|
63 |
+
// Setup keyboard navigation
|
64 |
+
this.setupKeyboardNavigation();
|
65 |
+
|
66 |
+
// Load initial dataset
|
67 |
+
await this.loadDataset();
|
68 |
+
},
|
69 |
+
|
70 |
+
setupKeyboardNavigation() {
|
71 |
+
document.addEventListener('keydown', (e) => {
|
72 |
+
// Ignore if user is typing in input
|
73 |
+
if (e.target.tagName === 'INPUT') return;
|
74 |
+
|
75 |
+
switch(e.key) {
|
76 |
+
case 'ArrowLeft':
|
77 |
+
e.preventDefault();
|
78 |
+
if (e.shiftKey && this.showDock) {
|
79 |
+
this.scrollDockLeft();
|
80 |
+
} else {
|
81 |
+
this.previousSample();
|
82 |
+
}
|
83 |
+
break;
|
84 |
+
case 'ArrowRight':
|
85 |
+
e.preventDefault();
|
86 |
+
if (e.shiftKey && this.showDock) {
|
87 |
+
this.scrollDockRight();
|
88 |
+
} else {
|
89 |
+
this.nextSample();
|
90 |
+
}
|
91 |
+
break;
|
92 |
+
case 'k':
|
93 |
+
case 'K':
|
94 |
+
e.preventDefault();
|
95 |
+
this.previousSample();
|
96 |
+
break;
|
97 |
+
case 'j':
|
98 |
+
case 'J':
|
99 |
+
e.preventDefault();
|
100 |
+
this.nextSample();
|
101 |
+
break;
|
102 |
+
case '1':
|
103 |
+
this.activeTab = 'comparison';
|
104 |
+
break;
|
105 |
+
case '2':
|
106 |
+
this.activeTab = 'diff';
|
107 |
+
break;
|
108 |
+
case '3':
|
109 |
+
this.activeTab = 'improved';
|
110 |
+
break;
|
111 |
+
case 'v':
|
112 |
+
case 'V':
|
113 |
+
// Toggle dock with V key
|
114 |
+
if (this.showDock) {
|
115 |
+
this.hideDockPreview();
|
116 |
+
} else {
|
117 |
+
this.showDockPreview();
|
118 |
+
}
|
119 |
+
break;
|
120 |
+
}
|
121 |
+
});
|
122 |
+
},
|
123 |
+
|
124 |
+
async loadDataset() {
|
125 |
+
this.loading = true;
|
126 |
+
this.error = null;
|
127 |
+
|
128 |
+
try {
|
129 |
+
// Validate dataset
|
130 |
+
await this.api.validateDataset(this.datasetId);
|
131 |
+
|
132 |
+
// Get dataset info
|
133 |
+
const info = await this.api.getDatasetInfo(this.datasetId);
|
134 |
+
this.datasetConfig = info.defaultConfig;
|
135 |
+
this.datasetSplit = info.defaultSplit;
|
136 |
+
|
137 |
+
// Get total rows
|
138 |
+
this.totalSamples = await this.api.getTotalRows(
|
139 |
+
this.datasetId,
|
140 |
+
this.datasetConfig,
|
141 |
+
this.datasetSplit
|
142 |
+
);
|
143 |
+
|
144 |
+
// Load first sample
|
145 |
+
this.currentIndex = 0;
|
146 |
+
await this.loadSample(0);
|
147 |
+
|
148 |
+
} catch (error) {
|
149 |
+
this.error = error.message;
|
150 |
+
} finally {
|
151 |
+
this.loading = false;
|
152 |
+
}
|
153 |
+
},
|
154 |
+
|
155 |
+
async loadSample(index) {
|
156 |
+
try {
|
157 |
+
const data = await this.api.getRow(
|
158 |
+
this.datasetId,
|
159 |
+
this.datasetConfig,
|
160 |
+
this.datasetSplit,
|
161 |
+
index
|
162 |
+
);
|
163 |
+
|
164 |
+
this.currentSample = data.row;
|
165 |
+
this.currentIndex = index;
|
166 |
+
|
167 |
+
// Update diff when sample changes
|
168 |
+
this.updateDiff();
|
169 |
+
|
170 |
+
// Update URL without triggering navigation
|
171 |
+
const url = new URL(window.location);
|
172 |
+
url.searchParams.set('dataset', this.datasetId);
|
173 |
+
url.searchParams.set('index', index);
|
174 |
+
window.history.replaceState({}, '', url);
|
175 |
+
|
176 |
+
} catch (error) {
|
177 |
+
this.error = `Failed to load sample: ${error.message}`;
|
178 |
+
}
|
179 |
+
},
|
180 |
+
|
181 |
+
async nextSample() {
|
182 |
+
if (this.currentIndex < this.totalSamples - 1) {
|
183 |
+
await this.loadSample(this.currentIndex + 1);
|
184 |
+
}
|
185 |
+
},
|
186 |
+
|
187 |
+
async previousSample() {
|
188 |
+
if (this.currentIndex > 0) {
|
189 |
+
await this.loadSample(this.currentIndex - 1);
|
190 |
+
}
|
191 |
+
},
|
192 |
+
|
193 |
+
async jumpToSample() {
|
194 |
+
const pageNum = parseInt(this.jumpToPage);
|
195 |
+
if (!isNaN(pageNum) && pageNum >= 1 && pageNum <= this.totalSamples) {
|
196 |
+
// Convert 1-based page number to 0-based index
|
197 |
+
await this.loadSample(pageNum - 1);
|
198 |
+
// Clear the input after jumping
|
199 |
+
this.jumpToPage = '';
|
200 |
+
} else {
|
201 |
+
// Show error or just reset
|
202 |
+
this.jumpToPage = '';
|
203 |
+
}
|
204 |
+
},
|
205 |
+
|
206 |
+
getOriginalText() {
|
207 |
+
if (!this.currentSample) return '';
|
208 |
+
const columns = this.api.detectColumns(null, this.currentSample);
|
209 |
+
return this.currentSample[columns.originalText] || 'No original text found';
|
210 |
+
},
|
211 |
+
|
212 |
+
getImprovedText() {
|
213 |
+
if (!this.currentSample) return '';
|
214 |
+
const columns = this.api.detectColumns(null, this.currentSample);
|
215 |
+
return this.currentSample[columns.improvedText] || 'No improved text found';
|
216 |
+
},
|
217 |
+
|
218 |
+
getImageData() {
|
219 |
+
if (!this.currentSample) return null;
|
220 |
+
const columns = this.api.detectColumns(null, this.currentSample);
|
221 |
+
return columns.image ? this.currentSample[columns.image] : null;
|
222 |
+
},
|
223 |
+
|
224 |
+
getImageSrc() {
|
225 |
+
const imageData = this.getImageData();
|
226 |
+
return imageData?.src || '';
|
227 |
+
},
|
228 |
+
|
229 |
+
getImageDimensions() {
|
230 |
+
const imageData = this.getImageData();
|
231 |
+
if (imageData?.width && imageData?.height) {
|
232 |
+
return `${imageData.width}×${imageData.height}`;
|
233 |
+
}
|
234 |
+
return null;
|
235 |
+
},
|
236 |
+
|
237 |
+
updateDiff() {
|
238 |
+
const original = this.getOriginalText();
|
239 |
+
const improved = this.getImprovedText();
|
240 |
+
|
241 |
+
// Calculate statistics
|
242 |
+
this.calculateStatistics(original, improved);
|
243 |
+
|
244 |
+
// Use diff utility based on mode
|
245 |
+
switch(this.diffMode) {
|
246 |
+
case 'char':
|
247 |
+
this.diffHtml = createCharacterDiff(original, improved);
|
248 |
+
break;
|
249 |
+
case 'word':
|
250 |
+
this.diffHtml = createWordDiff(original, improved);
|
251 |
+
break;
|
252 |
+
case 'line':
|
253 |
+
this.diffHtml = createLineDiff(original, improved);
|
254 |
+
break;
|
255 |
+
}
|
256 |
+
},
|
257 |
+
|
258 |
+
calculateStatistics(original, improved) {
|
259 |
+
// Calculate similarity
|
260 |
+
this.similarity = calculateSimilarity(original, improved);
|
261 |
+
|
262 |
+
// Character statistics
|
263 |
+
const charDiff = this.getCharacterDiffStats(original, improved);
|
264 |
+
this.charStats = charDiff;
|
265 |
+
|
266 |
+
// Word statistics
|
267 |
+
const originalWords = original.split(/\s+/).filter(w => w.length > 0);
|
268 |
+
const improvedWords = improved.split(/\s+/).filter(w => w.length > 0);
|
269 |
+
this.wordStats = {
|
270 |
+
original: originalWords.length,
|
271 |
+
improved: improvedWords.length
|
272 |
+
};
|
273 |
+
},
|
274 |
+
|
275 |
+
getCharacterDiffStats(original, improved) {
|
276 |
+
const dp = computeLCS(original, improved);
|
277 |
+
const diff = buildDiff(original, improved, dp);
|
278 |
+
|
279 |
+
let added = 0;
|
280 |
+
let removed = 0;
|
281 |
+
let unchanged = 0;
|
282 |
+
|
283 |
+
for (const part of diff) {
|
284 |
+
if (part.type === 'insert') {
|
285 |
+
added += part.value.length;
|
286 |
+
} else if (part.type === 'delete') {
|
287 |
+
removed += part.value.length;
|
288 |
+
} else {
|
289 |
+
unchanged += part.value.length;
|
290 |
+
}
|
291 |
+
}
|
292 |
+
|
293 |
+
return {
|
294 |
+
total: original.length,
|
295 |
+
added: added,
|
296 |
+
removed: removed,
|
297 |
+
unchanged: unchanged
|
298 |
+
};
|
299 |
+
},
|
300 |
+
|
301 |
+
async handleImageError(event) {
|
302 |
+
// Try to refresh the image URL
|
303 |
+
console.log('Image failed to load, refreshing URL...');
|
304 |
+
try {
|
305 |
+
const data = await this.api.refreshImageUrl(
|
306 |
+
this.datasetId,
|
307 |
+
this.datasetConfig,
|
308 |
+
this.datasetSplit,
|
309 |
+
this.currentIndex
|
310 |
+
);
|
311 |
+
|
312 |
+
// Update the image source
|
313 |
+
if (data.row && data.row[this.api.detectColumns(null, data.row).image]?.src) {
|
314 |
+
event.target.src = data.row[this.api.detectColumns(null, data.row).image].src;
|
315 |
+
}
|
316 |
+
} catch (error) {
|
317 |
+
console.error('Failed to refresh image URL:', error);
|
318 |
+
// Set a placeholder image
|
319 |
+
event.target.src = '';
|
320 |
+
}
|
321 |
+
},
|
322 |
+
|
323 |
+
exportComparison() {
|
324 |
+
const original = this.getOriginalText();
|
325 |
+
const improved = this.getImprovedText();
|
326 |
+
const metadata = {
|
327 |
+
dataset: this.datasetId,
|
328 |
+
page: this.currentIndex + 1,
|
329 |
+
totalPages: this.totalSamples,
|
330 |
+
exportDate: new Date().toISOString(),
|
331 |
+
similarity: `${this.similarity}%`,
|
332 |
+
statistics: {
|
333 |
+
characters: this.charStats,
|
334 |
+
words: this.wordStats
|
335 |
+
}
|
336 |
+
};
|
337 |
+
|
338 |
+
// Create export content
|
339 |
+
let content = `OCR Text Comparison Export\n`;
|
340 |
+
content += `==========================\n\n`;
|
341 |
+
content += `Dataset: ${metadata.dataset}\n`;
|
342 |
+
content += `Page: ${metadata.page} of ${metadata.totalPages}\n`;
|
343 |
+
content += `Export Date: ${new Date().toLocaleString()}\n`;
|
344 |
+
content += `Similarity: ${metadata.similarity}\n`;
|
345 |
+
content += `Characters: ${metadata.statistics.characters.total} total, `;
|
346 |
+
content += `${metadata.statistics.characters.added} added, `;
|
347 |
+
content += `${metadata.statistics.characters.removed} removed\n`;
|
348 |
+
content += `Words: ${metadata.statistics.words.original} → ${metadata.statistics.words.improved}\n`;
|
349 |
+
content += `\n${'='.repeat(50)}\n\n`;
|
350 |
+
content += `ORIGINAL OCR:\n`;
|
351 |
+
content += `${'='.repeat(50)}\n`;
|
352 |
+
content += original;
|
353 |
+
content += `\n\n${'='.repeat(50)}\n\n`;
|
354 |
+
content += `IMPROVED OCR:\n`;
|
355 |
+
content += `${'='.repeat(50)}\n`;
|
356 |
+
content += improved;
|
357 |
+
|
358 |
+
// Download file
|
359 |
+
const blob = new Blob([content], { type: 'text/plain' });
|
360 |
+
const url = URL.createObjectURL(blob);
|
361 |
+
const a = document.createElement('a');
|
362 |
+
a.href = url;
|
363 |
+
a.download = `ocr-comparison-${this.datasetId.replace('/', '-')}-page-${this.currentIndex + 1}.txt`;
|
364 |
+
document.body.appendChild(a);
|
365 |
+
a.click();
|
366 |
+
document.body.removeChild(a);
|
367 |
+
URL.revokeObjectURL(url);
|
368 |
+
},
|
369 |
+
|
370 |
+
// Flow view methods
|
371 |
+
async toggleFlowView() {
|
372 |
+
this.showFlowView = !this.showFlowView;
|
373 |
+
if (this.showFlowView) {
|
374 |
+
// Reset to center around current page when opening
|
375 |
+
this.flowStartIndex = Math.max(0, this.currentIndex - Math.floor(this.flowVisibleCount / 2));
|
376 |
+
await this.loadFlowItems();
|
377 |
+
}
|
378 |
+
},
|
379 |
+
|
380 |
+
async loadFlowItems() {
|
381 |
+
// Load thumbnails from flowStartIndex
|
382 |
+
const startIdx = this.flowStartIndex;
|
383 |
+
this.flowItems = [];
|
384 |
+
|
385 |
+
// Load visible items
|
386 |
+
for (let i = 0; i < this.flowVisibleCount && (startIdx + i) < this.totalSamples; i++) {
|
387 |
+
const idx = startIdx + i;
|
388 |
+
try {
|
389 |
+
const data = await this.api.getRow(
|
390 |
+
this.datasetId,
|
391 |
+
this.datasetConfig,
|
392 |
+
this.datasetSplit,
|
393 |
+
idx
|
394 |
+
);
|
395 |
+
|
396 |
+
const columns = this.api.detectColumns(null, data.row);
|
397 |
+
const imageData = columns.image ? data.row[columns.image] : null;
|
398 |
+
|
399 |
+
this.flowItems.push({
|
400 |
+
index: idx,
|
401 |
+
imageSrc: imageData?.src || '',
|
402 |
+
row: data.row
|
403 |
+
});
|
404 |
+
} catch (error) {
|
405 |
+
console.error(`Failed to load flow item ${idx}:`, error);
|
406 |
+
}
|
407 |
+
}
|
408 |
+
},
|
409 |
+
|
410 |
+
scrollFlowLeft() {
|
411 |
+
if (this.flowStartIndex > 0) {
|
412 |
+
this.flowStartIndex = Math.max(0, this.flowStartIndex - this.flowVisibleCount);
|
413 |
+
this.loadFlowItems();
|
414 |
+
}
|
415 |
+
},
|
416 |
+
|
417 |
+
scrollFlowRight() {
|
418 |
+
if (this.flowStartIndex < this.totalSamples - this.flowVisibleCount) {
|
419 |
+
this.flowStartIndex = Math.min(
|
420 |
+
this.totalSamples - this.flowVisibleCount,
|
421 |
+
this.flowStartIndex + this.flowVisibleCount
|
422 |
+
);
|
423 |
+
this.loadFlowItems();
|
424 |
+
}
|
425 |
+
},
|
426 |
+
|
427 |
+
async jumpToFlowPage(index) {
|
428 |
+
this.showFlowView = false;
|
429 |
+
await this.loadSample(index);
|
430 |
+
},
|
431 |
+
|
432 |
+
async handleFlowImageError(event, index) {
|
433 |
+
// Try to refresh the image URL for flow item
|
434 |
+
try {
|
435 |
+
const data = await this.api.refreshImageUrl(
|
436 |
+
this.datasetId,
|
437 |
+
this.datasetConfig,
|
438 |
+
this.datasetSplit,
|
439 |
+
index
|
440 |
+
);
|
441 |
+
|
442 |
+
if (data.row) {
|
443 |
+
const columns = this.api.detectColumns(null, data.row);
|
444 |
+
const imageData = columns.image ? data.row[columns.image] : null;
|
445 |
+
if (imageData?.src) {
|
446 |
+
event.target.src = imageData.src;
|
447 |
+
// Update the flow item
|
448 |
+
const flowItem = this.flowItems.find(item => item.index === index);
|
449 |
+
if (flowItem) {
|
450 |
+
flowItem.imageSrc = imageData.src;
|
451 |
+
}
|
452 |
+
}
|
453 |
+
}
|
454 |
+
} catch (error) {
|
455 |
+
console.error('Failed to refresh flow image URL:', error);
|
456 |
+
}
|
457 |
+
},
|
458 |
+
|
459 |
+
// Dock methods
|
460 |
+
async showDockPreview() {
|
461 |
+
// Clear any hide timeout
|
462 |
+
if (this.dockHideTimeout) {
|
463 |
+
clearTimeout(this.dockHideTimeout);
|
464 |
+
this.dockHideTimeout = null;
|
465 |
+
}
|
466 |
+
|
467 |
+
this.showDock = true;
|
468 |
+
|
469 |
+
// Center dock around current page
|
470 |
+
this.dockStartIndex = Math.max(0,
|
471 |
+
Math.min(
|
472 |
+
this.currentIndex - Math.floor(this.dockVisibleCount / 2),
|
473 |
+
this.totalSamples - this.dockVisibleCount
|
474 |
+
)
|
475 |
+
);
|
476 |
+
|
477 |
+
// Always reload dock items to show current position
|
478 |
+
await this.loadDockItems();
|
479 |
+
},
|
480 |
+
|
481 |
+
hideDockPreview() {
|
482 |
+
// Add a small delay to prevent flickering
|
483 |
+
this.dockHideTimeout = setTimeout(() => {
|
484 |
+
this.showDock = false;
|
485 |
+
}, 300);
|
486 |
+
},
|
487 |
+
|
488 |
+
async loadDockItems() {
|
489 |
+
// Load thumbnails based on dock start index
|
490 |
+
const endIdx = Math.min(this.totalSamples, this.dockStartIndex + this.dockVisibleCount);
|
491 |
+
|
492 |
+
this.dockItems = [];
|
493 |
+
|
494 |
+
for (let i = this.dockStartIndex; i < endIdx; i++) {
|
495 |
+
try {
|
496 |
+
const data = await this.api.getRow(
|
497 |
+
this.datasetId,
|
498 |
+
this.datasetConfig,
|
499 |
+
this.datasetSplit,
|
500 |
+
i
|
501 |
+
);
|
502 |
+
|
503 |
+
const columns = this.api.detectColumns(null, data.row);
|
504 |
+
const imageData = columns.image ? data.row[columns.image] : null;
|
505 |
+
|
506 |
+
this.dockItems.push({
|
507 |
+
index: i,
|
508 |
+
imageSrc: imageData?.src || '',
|
509 |
+
row: data.row
|
510 |
+
});
|
511 |
+
} catch (error) {
|
512 |
+
console.error(`Failed to load dock item ${i}:`, error);
|
513 |
+
}
|
514 |
+
}
|
515 |
+
},
|
516 |
+
|
517 |
+
async scrollDockLeft() {
|
518 |
+
if (this.dockStartIndex > 0) {
|
519 |
+
this.dockStartIndex = Math.max(0, this.dockStartIndex - Math.floor(this.dockVisibleCount / 2));
|
520 |
+
await this.loadDockItems();
|
521 |
+
}
|
522 |
+
},
|
523 |
+
|
524 |
+
async scrollDockRight() {
|
525 |
+
if (this.dockStartIndex < this.totalSamples - this.dockVisibleCount) {
|
526 |
+
this.dockStartIndex = Math.min(
|
527 |
+
this.totalSamples - this.dockVisibleCount,
|
528 |
+
this.dockStartIndex + Math.floor(this.dockVisibleCount / 2)
|
529 |
+
);
|
530 |
+
await this.loadDockItems();
|
531 |
+
}
|
532 |
+
},
|
533 |
+
|
534 |
+
async jumpToDockPage(index) {
|
535 |
+
this.showDock = false;
|
536 |
+
await this.loadSample(index);
|
537 |
+
},
|
538 |
+
|
539 |
+
// Watch for diff mode changes
|
540 |
+
initWatchers() {
|
541 |
+
this.$watch('diffMode', () => this.updateDiff());
|
542 |
+
this.$watch('currentSample', () => this.updateDiff());
|
543 |
+
}
|
544 |
+
}));
|
545 |
+
});
|
546 |
+
|
547 |
+
// Initialize watchers after Alpine loads
|
548 |
+
document.addEventListener('alpine:initialized', () => {
|
549 |
+
Alpine.store('ocrExplorer')?.initWatchers?.();
|
550 |
+
});
|
js/dataset-api.js
ADDED
@@ -0,0 +1,273 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
/**
|
2 |
+
* HuggingFace Dataset Viewer API wrapper
|
3 |
+
* Handles fetching data from the datasets-server API with caching and error handling
|
4 |
+
*/
|
5 |
+
|
6 |
+
class DatasetAPI {
|
7 |
+
constructor() {
|
8 |
+
this.baseURL = 'https://datasets-server.huggingface.co';
|
9 |
+
this.cache = new Map();
|
10 |
+
this.cacheExpiry = 45 * 60 * 1000; // 45 minutes (conservative for signed URLs)
|
11 |
+
this.rowsPerFetch = 100; // API maximum
|
12 |
+
}
|
13 |
+
|
14 |
+
/**
|
15 |
+
* Check if a dataset is valid and has viewer enabled
|
16 |
+
*/
|
17 |
+
async validateDataset(datasetId) {
|
18 |
+
try {
|
19 |
+
const response = await fetch(`${this.baseURL}/is-valid?dataset=${encodeURIComponent(datasetId)}`);
|
20 |
+
if (!response.ok) {
|
21 |
+
throw new Error(`Failed to validate dataset: ${response.statusText}`);
|
22 |
+
}
|
23 |
+
const data = await response.json();
|
24 |
+
|
25 |
+
if (!data.viewer) {
|
26 |
+
throw new Error('Dataset viewer is not available for this dataset');
|
27 |
+
}
|
28 |
+
|
29 |
+
return true;
|
30 |
+
} catch (error) {
|
31 |
+
throw new Error(`Dataset validation failed: ${error.message}`);
|
32 |
+
}
|
33 |
+
}
|
34 |
+
|
35 |
+
/**
|
36 |
+
* Get dataset info including splits and configs
|
37 |
+
*/
|
38 |
+
async getDatasetInfo(datasetId) {
|
39 |
+
const cacheKey = `info_${datasetId}`;
|
40 |
+
const cached = this.getFromCache(cacheKey);
|
41 |
+
if (cached) return cached;
|
42 |
+
|
43 |
+
try {
|
44 |
+
const response = await fetch(`${this.baseURL}/splits?dataset=${encodeURIComponent(datasetId)}`);
|
45 |
+
if (!response.ok) {
|
46 |
+
throw new Error(`Failed to get dataset info: ${response.statusText}`);
|
47 |
+
}
|
48 |
+
const data = await response.json();
|
49 |
+
|
50 |
+
// Extract the default config and split
|
51 |
+
const defaultConfig = data.splits[0]?.config || 'default';
|
52 |
+
const defaultSplit = data.splits.find(s => s.split === 'train')?.split || data.splits[0]?.split || 'train';
|
53 |
+
|
54 |
+
const info = {
|
55 |
+
configs: [...new Set(data.splits.map(s => s.config))],
|
56 |
+
splits: [...new Set(data.splits.map(s => s.split))],
|
57 |
+
defaultConfig,
|
58 |
+
defaultSplit,
|
59 |
+
raw: data
|
60 |
+
};
|
61 |
+
|
62 |
+
this.setCache(cacheKey, info);
|
63 |
+
return info;
|
64 |
+
} catch (error) {
|
65 |
+
throw new Error(`Failed to get dataset info: ${error.message}`);
|
66 |
+
}
|
67 |
+
}
|
68 |
+
|
69 |
+
/**
|
70 |
+
* Get the total number of rows in a dataset
|
71 |
+
*/
|
72 |
+
async getTotalRows(datasetId, config, split) {
|
73 |
+
const cacheKey = `size_${datasetId}_${config}_${split}`;
|
74 |
+
const cached = this.getFromCache(cacheKey);
|
75 |
+
if (cached) return cached;
|
76 |
+
|
77 |
+
try {
|
78 |
+
// First try to get from the size endpoint
|
79 |
+
const sizeResponse = await fetch(
|
80 |
+
`${this.baseURL}/size?dataset=${encodeURIComponent(datasetId)}&config=${encodeURIComponent(config)}&split=${encodeURIComponent(split)}`
|
81 |
+
);
|
82 |
+
|
83 |
+
if (sizeResponse.ok) {
|
84 |
+
const sizeData = await sizeResponse.json();
|
85 |
+
// The API returns num_rows in size.config or size.splits[0]
|
86 |
+
const size = sizeData.size?.config?.num_rows ||
|
87 |
+
sizeData.size?.splits?.[0]?.num_rows ||
|
88 |
+
0;
|
89 |
+
this.setCache(cacheKey, size);
|
90 |
+
return size;
|
91 |
+
}
|
92 |
+
|
93 |
+
// Fallback: get first rows and check num_rows_total
|
94 |
+
const rowsResponse = await fetch(
|
95 |
+
`${this.baseURL}/first-rows?dataset=${encodeURIComponent(datasetId)}&config=${encodeURIComponent(config)}&split=${encodeURIComponent(split)}`
|
96 |
+
);
|
97 |
+
|
98 |
+
if (!rowsResponse.ok) {
|
99 |
+
throw new Error('Unable to determine dataset size');
|
100 |
+
}
|
101 |
+
|
102 |
+
const rowsData = await rowsResponse.json();
|
103 |
+
const size = rowsData.num_rows_total || rowsData.rows?.length || 0;
|
104 |
+
this.setCache(cacheKey, size);
|
105 |
+
return size;
|
106 |
+
} catch (error) {
|
107 |
+
console.warn('Failed to get total rows:', error);
|
108 |
+
return null;
|
109 |
+
}
|
110 |
+
}
|
111 |
+
|
112 |
+
/**
|
113 |
+
* Fetch rows from the dataset
|
114 |
+
*/
|
115 |
+
async fetchRows(datasetId, config, split, offset, length = this.rowsPerFetch) {
|
116 |
+
const cacheKey = `rows_${datasetId}_${config}_${split}_${offset}_${length}`;
|
117 |
+
const cached = this.getFromCache(cacheKey);
|
118 |
+
if (cached) return cached;
|
119 |
+
|
120 |
+
try {
|
121 |
+
const response = await fetch(
|
122 |
+
`${this.baseURL}/rows?dataset=${encodeURIComponent(datasetId)}&config=${encodeURIComponent(config)}&split=${encodeURIComponent(split)}&offset=${offset}&length=${length}`
|
123 |
+
);
|
124 |
+
|
125 |
+
if (!response.ok) {
|
126 |
+
if (response.status === 403) {
|
127 |
+
throw new Error('Access denied. This dataset may be private or gated.');
|
128 |
+
}
|
129 |
+
throw new Error(`Failed to fetch rows: ${response.statusText}`);
|
130 |
+
}
|
131 |
+
|
132 |
+
const data = await response.json();
|
133 |
+
|
134 |
+
// Extract column information
|
135 |
+
const columns = this.detectColumns(data.features, data.rows[0]?.row);
|
136 |
+
|
137 |
+
const result = {
|
138 |
+
rows: data.rows,
|
139 |
+
features: data.features,
|
140 |
+
columns,
|
141 |
+
numRowsTotal: data.num_rows_total,
|
142 |
+
partial: data.partial || false
|
143 |
+
};
|
144 |
+
|
145 |
+
this.setCache(cacheKey, result);
|
146 |
+
return result;
|
147 |
+
} catch (error) {
|
148 |
+
throw new Error(`Failed to fetch rows: ${error.message}`);
|
149 |
+
}
|
150 |
+
}
|
151 |
+
|
152 |
+
/**
|
153 |
+
* Get a single row by index with smart batching
|
154 |
+
*/
|
155 |
+
async getRow(datasetId, config, split, index) {
|
156 |
+
// Calculate which batch this index falls into
|
157 |
+
const batchStart = Math.floor(index / this.rowsPerFetch) * this.rowsPerFetch;
|
158 |
+
const batchData = await this.fetchRows(datasetId, config, split, batchStart, this.rowsPerFetch);
|
159 |
+
|
160 |
+
const localIndex = index - batchStart;
|
161 |
+
if (localIndex >= 0 && localIndex < batchData.rows.length) {
|
162 |
+
return {
|
163 |
+
row: batchData.rows[localIndex].row,
|
164 |
+
columns: batchData.columns,
|
165 |
+
numRowsTotal: batchData.numRowsTotal
|
166 |
+
};
|
167 |
+
}
|
168 |
+
|
169 |
+
throw new Error(`Row ${index} not found`);
|
170 |
+
}
|
171 |
+
|
172 |
+
/**
|
173 |
+
* Detect column names for image and text data
|
174 |
+
*/
|
175 |
+
detectColumns(features, sampleRow) {
|
176 |
+
let imageColumn = null;
|
177 |
+
let originalTextColumn = null;
|
178 |
+
let improvedTextColumn = null;
|
179 |
+
|
180 |
+
// Try to detect from features first
|
181 |
+
for (const feature of features || []) {
|
182 |
+
const name = feature.name;
|
183 |
+
const type = feature.type;
|
184 |
+
|
185 |
+
// Detect image column
|
186 |
+
if (type._type === 'Image' || type.dtype === 'image' || type.feature?._type === 'Image') {
|
187 |
+
imageColumn = name;
|
188 |
+
}
|
189 |
+
|
190 |
+
// Detect text columns based on common patterns
|
191 |
+
if (!originalTextColumn && ['text', 'ocr', 'original_text', 'original', 'ground_truth'].includes(name)) {
|
192 |
+
originalTextColumn = name;
|
193 |
+
}
|
194 |
+
|
195 |
+
if (!improvedTextColumn && ['markdown', 'new_ocr', 'corrected_text', 'improved', 'vlm_ocr', 'corrected'].includes(name)) {
|
196 |
+
improvedTextColumn = name;
|
197 |
+
}
|
198 |
+
}
|
199 |
+
|
200 |
+
// Fallback: detect from sample row
|
201 |
+
if (sampleRow) {
|
202 |
+
const keys = Object.keys(sampleRow);
|
203 |
+
|
204 |
+
if (!imageColumn) {
|
205 |
+
for (const key of keys) {
|
206 |
+
if (sampleRow[key]?.src && sampleRow[key]?.height !== undefined) {
|
207 |
+
imageColumn = key;
|
208 |
+
break;
|
209 |
+
}
|
210 |
+
}
|
211 |
+
}
|
212 |
+
|
213 |
+
// Additional text column detection from row data
|
214 |
+
if (!originalTextColumn) {
|
215 |
+
const candidates = ['text', 'ocr', 'original_text', 'original'];
|
216 |
+
originalTextColumn = keys.find(k => candidates.includes(k)) || null;
|
217 |
+
}
|
218 |
+
|
219 |
+
if (!improvedTextColumn) {
|
220 |
+
const candidates = ['markdown', 'new_ocr', 'corrected_text', 'improved'];
|
221 |
+
improvedTextColumn = keys.find(k => candidates.includes(k)) || null;
|
222 |
+
}
|
223 |
+
}
|
224 |
+
|
225 |
+
return {
|
226 |
+
image: imageColumn,
|
227 |
+
originalText: originalTextColumn,
|
228 |
+
improvedText: improvedTextColumn
|
229 |
+
};
|
230 |
+
}
|
231 |
+
|
232 |
+
/**
|
233 |
+
* Refresh expired image URL by re-fetching the row
|
234 |
+
*/
|
235 |
+
async refreshImageUrl(datasetId, config, split, index) {
|
236 |
+
// Clear cache for this specific row batch
|
237 |
+
const batchStart = Math.floor(index / this.rowsPerFetch) * this.rowsPerFetch;
|
238 |
+
const cacheKey = `rows_${datasetId}_${config}_${split}_${batchStart}_${this.rowsPerFetch}`;
|
239 |
+
this.cache.delete(cacheKey);
|
240 |
+
|
241 |
+
// Re-fetch the row
|
242 |
+
return await this.getRow(datasetId, config, split, index);
|
243 |
+
}
|
244 |
+
|
245 |
+
/**
|
246 |
+
* Cache management utilities
|
247 |
+
*/
|
248 |
+
getFromCache(key) {
|
249 |
+
const cached = this.cache.get(key);
|
250 |
+
if (!cached) return null;
|
251 |
+
|
252 |
+
if (Date.now() - cached.timestamp > this.cacheExpiry) {
|
253 |
+
this.cache.delete(key);
|
254 |
+
return null;
|
255 |
+
}
|
256 |
+
|
257 |
+
return cached.data;
|
258 |
+
}
|
259 |
+
|
260 |
+
setCache(key, data) {
|
261 |
+
this.cache.set(key, {
|
262 |
+
data,
|
263 |
+
timestamp: Date.now()
|
264 |
+
});
|
265 |
+
}
|
266 |
+
|
267 |
+
clearCache() {
|
268 |
+
this.cache.clear();
|
269 |
+
}
|
270 |
+
}
|
271 |
+
|
272 |
+
// Export for use in other scripts
|
273 |
+
window.DatasetAPI = DatasetAPI;
|
js/diff-utils.js
ADDED
@@ -0,0 +1,219 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
/**
|
2 |
+
* Text comparison utilities for OCR Text Explorer
|
3 |
+
* Provides character, word, and line-level diff visualization
|
4 |
+
*/
|
5 |
+
|
6 |
+
/**
|
7 |
+
* Create character-level diff with inline highlighting
|
8 |
+
*/
|
9 |
+
function createCharacterDiff(original, improved) {
|
10 |
+
if (!original || !improved) {
|
11 |
+
return '<p class="text-gray-500">No text to compare</p>';
|
12 |
+
}
|
13 |
+
|
14 |
+
const dp = computeLCS(original, improved);
|
15 |
+
const diff = buildDiff(original, improved, dp);
|
16 |
+
|
17 |
+
let html = '<div class="font-mono text-sm whitespace-pre-wrap text-gray-900 dark:text-gray-100">';
|
18 |
+
|
19 |
+
for (const part of diff) {
|
20 |
+
if (part.type === 'equal') {
|
21 |
+
html += escapeHtml(part.value);
|
22 |
+
} else if (part.type === 'delete') {
|
23 |
+
html += `<span class="bg-red-200 dark:bg-red-950 text-red-800 dark:text-red-300 line-through">${escapeHtml(part.value)}</span>`;
|
24 |
+
} else if (part.type === 'insert') {
|
25 |
+
html += `<span class="bg-green-200 dark:bg-green-950 text-green-800 dark:text-green-300">${escapeHtml(part.value)}</span>`;
|
26 |
+
}
|
27 |
+
}
|
28 |
+
|
29 |
+
html += '</div>';
|
30 |
+
return html;
|
31 |
+
}
|
32 |
+
|
33 |
+
/**
|
34 |
+
* Create word-level diff
|
35 |
+
*/
|
36 |
+
function createWordDiff(original, improved) {
|
37 |
+
if (!original || !improved) {
|
38 |
+
return '<p class="text-gray-500">No text to compare</p>';
|
39 |
+
}
|
40 |
+
|
41 |
+
// Split into words while preserving whitespace
|
42 |
+
const originalWords = splitIntoWords(original);
|
43 |
+
const improvedWords = splitIntoWords(improved);
|
44 |
+
|
45 |
+
const dp = computeLCS(originalWords, improvedWords);
|
46 |
+
const diff = buildDiff(originalWords, improvedWords, dp);
|
47 |
+
|
48 |
+
let html = '<div class="font-mono text-sm whitespace-pre-wrap text-gray-900 dark:text-gray-100">';
|
49 |
+
|
50 |
+
for (const part of diff) {
|
51 |
+
if (part.type === 'equal') {
|
52 |
+
html += escapeHtml(part.value.join(''));
|
53 |
+
} else if (part.type === 'delete') {
|
54 |
+
html += `<span class="bg-red-200 dark:bg-red-950 text-red-800 dark:text-red-300 line-through">${escapeHtml(part.value.join(''))}</span>`;
|
55 |
+
} else if (part.type === 'insert') {
|
56 |
+
html += `<span class="bg-green-200 dark:bg-green-950 text-green-800 dark:text-green-300">${escapeHtml(part.value.join(''))}</span>`;
|
57 |
+
}
|
58 |
+
}
|
59 |
+
|
60 |
+
html += '</div>';
|
61 |
+
return html;
|
62 |
+
}
|
63 |
+
|
64 |
+
/**
|
65 |
+
* Create line-level diff
|
66 |
+
*/
|
67 |
+
function createLineDiff(original, improved) {
|
68 |
+
if (!original || !improved) {
|
69 |
+
return '<p class="text-gray-500">No text to compare</p>';
|
70 |
+
}
|
71 |
+
|
72 |
+
const originalLines = original.split('\n');
|
73 |
+
const improvedLines = improved.split('\n');
|
74 |
+
|
75 |
+
const dp = computeLCS(originalLines, improvedLines);
|
76 |
+
const diff = buildDiff(originalLines, improvedLines, dp);
|
77 |
+
|
78 |
+
let html = '<div class="font-mono text-sm text-gray-900 dark:text-gray-100">';
|
79 |
+
|
80 |
+
for (const part of diff) {
|
81 |
+
if (part.type === 'equal') {
|
82 |
+
for (const line of part.value) {
|
83 |
+
html += `<div class="py-1">${escapeHtml(line)}</div>`;
|
84 |
+
}
|
85 |
+
} else if (part.type === 'delete') {
|
86 |
+
for (const line of part.value) {
|
87 |
+
html += `<div class="py-1 bg-red-200 dark:bg-red-950 text-red-800 dark:text-red-300 line-through">${escapeHtml(line)}</div>`;
|
88 |
+
}
|
89 |
+
} else if (part.type === 'insert') {
|
90 |
+
for (const line of part.value) {
|
91 |
+
html += `<div class="py-1 bg-green-200 dark:bg-green-950 text-green-800 dark:text-green-300">${escapeHtml(line)}</div>`;
|
92 |
+
}
|
93 |
+
}
|
94 |
+
}
|
95 |
+
|
96 |
+
html += '</div>';
|
97 |
+
return html;
|
98 |
+
}
|
99 |
+
|
100 |
+
/**
|
101 |
+
* Compute Longest Common Subsequence using dynamic programming
|
102 |
+
*/
|
103 |
+
function computeLCS(a, b) {
|
104 |
+
const m = a.length;
|
105 |
+
const n = b.length;
|
106 |
+
const dp = Array(m + 1).fill(null).map(() => Array(n + 1).fill(0));
|
107 |
+
|
108 |
+
for (let i = 1; i <= m; i++) {
|
109 |
+
for (let j = 1; j <= n; j++) {
|
110 |
+
if (a[i - 1] === b[j - 1]) {
|
111 |
+
dp[i][j] = dp[i - 1][j - 1] + 1;
|
112 |
+
} else {
|
113 |
+
dp[i][j] = Math.max(dp[i - 1][j], dp[i][j - 1]);
|
114 |
+
}
|
115 |
+
}
|
116 |
+
}
|
117 |
+
|
118 |
+
return dp;
|
119 |
+
}
|
120 |
+
|
121 |
+
/**
|
122 |
+
* Build diff from LCS table
|
123 |
+
*/
|
124 |
+
function buildDiff(a, b, dp) {
|
125 |
+
const diff = [];
|
126 |
+
let i = a.length;
|
127 |
+
let j = b.length;
|
128 |
+
|
129 |
+
while (i > 0 || j > 0) {
|
130 |
+
if (i > 0 && j > 0 && a[i - 1] === b[j - 1]) {
|
131 |
+
// Characters are equal
|
132 |
+
if (diff.length > 0 && diff[diff.length - 1].type === 'equal') {
|
133 |
+
diff[diff.length - 1].value.unshift(a[i - 1]);
|
134 |
+
} else {
|
135 |
+
diff.push({ type: 'equal', value: [a[i - 1]] });
|
136 |
+
}
|
137 |
+
i--;
|
138 |
+
j--;
|
139 |
+
} else if (j > 0 && (i === 0 || dp[i][j - 1] >= dp[i - 1][j])) {
|
140 |
+
// Character in b but not in a (insertion)
|
141 |
+
if (diff.length > 0 && diff[diff.length - 1].type === 'insert') {
|
142 |
+
diff[diff.length - 1].value.unshift(b[j - 1]);
|
143 |
+
} else {
|
144 |
+
diff.push({ type: 'insert', value: [b[j - 1]] });
|
145 |
+
}
|
146 |
+
j--;
|
147 |
+
} else {
|
148 |
+
// Character in a but not in b (deletion)
|
149 |
+
if (diff.length > 0 && diff[diff.length - 1].type === 'delete') {
|
150 |
+
diff[diff.length - 1].value.unshift(a[i - 1]);
|
151 |
+
} else {
|
152 |
+
diff.push({ type: 'delete', value: [a[i - 1]] });
|
153 |
+
}
|
154 |
+
i--;
|
155 |
+
}
|
156 |
+
}
|
157 |
+
|
158 |
+
diff.reverse();
|
159 |
+
|
160 |
+
// Convert arrays to strings for character diff
|
161 |
+
if (typeof a === 'string') {
|
162 |
+
diff.forEach(part => {
|
163 |
+
part.value = part.value.join('');
|
164 |
+
});
|
165 |
+
}
|
166 |
+
|
167 |
+
return diff;
|
168 |
+
}
|
169 |
+
|
170 |
+
/**
|
171 |
+
* Split text into words while preserving whitespace
|
172 |
+
*/
|
173 |
+
function splitIntoWords(text) {
|
174 |
+
const words = [];
|
175 |
+
let current = '';
|
176 |
+
let inWord = false;
|
177 |
+
|
178 |
+
for (const char of text) {
|
179 |
+
if (/\s/.test(char)) {
|
180 |
+
if (inWord && current) {
|
181 |
+
words.push(current);
|
182 |
+
current = '';
|
183 |
+
inWord = false;
|
184 |
+
}
|
185 |
+
words.push(char);
|
186 |
+
} else {
|
187 |
+
current += char;
|
188 |
+
inWord = true;
|
189 |
+
}
|
190 |
+
}
|
191 |
+
|
192 |
+
if (current) {
|
193 |
+
words.push(current);
|
194 |
+
}
|
195 |
+
|
196 |
+
return words;
|
197 |
+
}
|
198 |
+
|
199 |
+
/**
|
200 |
+
* Escape HTML special characters
|
201 |
+
*/
|
202 |
+
function escapeHtml(text) {
|
203 |
+
const div = document.createElement('div');
|
204 |
+
div.textContent = text;
|
205 |
+
return div.innerHTML;
|
206 |
+
}
|
207 |
+
|
208 |
+
/**
|
209 |
+
* Calculate similarity percentage between two texts
|
210 |
+
*/
|
211 |
+
function calculateSimilarity(original, improved) {
|
212 |
+
if (!original || !improved) return 0;
|
213 |
+
|
214 |
+
const dp = computeLCS(original, improved);
|
215 |
+
const lcsLength = dp[original.length][improved.length];
|
216 |
+
const maxLength = Math.max(original.length, improved.length);
|
217 |
+
|
218 |
+
return Math.round((lcsLength / maxLength) * 100);
|
219 |
+
}
|