Mitchins commited on
Commit
92e35ee
·
verified ·
1 Parent(s): da6fa0c

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. README.md +178 -3
  2. config.json +14 -0
  3. model.onnx +3 -0
  4. model.safetensors +3 -0
  5. model_info.json +21 -0
README.md CHANGED
@@ -1,3 +1,178 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: multilingual
3
+ license: mit
4
+ library_name: pytorch
5
+ tags:
6
+ - text-classification
7
+ - language-detection
8
+ - byte-level
9
+ - multilingual
10
+ - english-detection
11
+ - cnn
12
+ pipeline_tag: text-classification
13
+ datasets:
14
+ - custom
15
+ metrics:
16
+ - accuracy
17
+ model-index:
18
+ - name: innit
19
+ results:
20
+ - task:
21
+ type: text-classification
22
+ name: English vs Non-English Detection
23
+ metrics:
24
+ - type: accuracy
25
+ value: 99.94
26
+ name: Validation Accuracy
27
+ - type: accuracy
28
+ value: 100.0
29
+ name: Challenge Set Accuracy
30
+ ---
31
+
32
+ # innit: Fast English vs Non-English Text Detection
33
+
34
+ A lightweight byte-level CNN for fast binary language detection (English vs Non-English).
35
+
36
+ ## Model Details
37
+
38
+ - **Model Type**: Byte-level Convolutional Neural Network
39
+ - **Task**: Binary text classification (English vs Non-English)
40
+ - **Architecture**: TinyByteCNN_EN with 6 convolutional blocks
41
+ - **Parameters**: 156,642
42
+ - **Input**: Raw UTF-8 bytes (max 256 bytes)
43
+ - **Output**: Binary classification (0=Non-English, 1=English)
44
+
45
+ ## Performance
46
+
47
+ - **Validation Accuracy**: 99.94%
48
+ - **Challenge Set Accuracy**: 100% (14/14 test cases)
49
+ - **Inference Speed**: Sub-millisecond on modern CPUs
50
+ - **Model Size**: ~600KB
51
+
52
+ ## Supported Languages
53
+
54
+ Trained to distinguish English from 52+ languages across diverse scripts:
55
+ - **Latin scripts**: Spanish, French, German, Italian, Dutch, Portuguese, etc.
56
+ - **CJK scripts**: Chinese (Simplified/Traditional), Japanese, Korean
57
+ - **Cyrillic scripts**: Russian, Ukrainian, Bulgarian, Serbian
58
+ - **Other scripts**: Arabic, Hindi, Bengali, Thai, Hebrew, etc.
59
+
60
+ ## Architecture
61
+
62
+ ```
63
+ TinyByteCNN_EN:
64
+ ├── Embedding: 257 → 80 dimensions (256 bytes + padding)
65
+ ├── 6x Convolutional Blocks:
66
+ │ ├── Conv1D (kernel=3, residual connections)
67
+ │ ├── GELU activation
68
+ │ ├── BatchNorm1D
69
+ │ └── Dropout (0.15)
70
+ ├── Enhanced Pooling: mean + max + std
71
+ └── Classification Head: 240 → 80 → 2
72
+ ```
73
+
74
+ ## Training Data
75
+
76
+ - **Total samples**: 17,543 balanced samples
77
+ - **English**: 8,772 samples from diverse sources
78
+ - **Non-English**: 8,771 samples across 52+ languages
79
+ - **Text lengths**: 3-276 characters (optimized for short texts)
80
+ - **Special coverage**: Emoji handling, mathematical formulas, scientific notation
81
+
82
+ ## Quick Start
83
+
84
+ ### Option 1: ONNX Runtime (Recommended)
85
+ ```python
86
+ import onnxruntime as ort
87
+ import numpy as np
88
+
89
+ # Load ONNX model
90
+ session = ort.InferenceSession("model.onnx")
91
+
92
+ def predict(text):
93
+ # Prepare input
94
+ bytes_data = text.encode('utf-8', errors='ignore')[:256]
95
+ padded = np.zeros(256, dtype=np.int64)
96
+ padded[:len(bytes_data)] = list(bytes_data)
97
+
98
+ # Run inference
99
+ outputs = session.run(['logits'], {'input_bytes': padded.reshape(1, -1)})
100
+ logits = outputs[0][0]
101
+
102
+ # Apply softmax
103
+ exp_logits = np.exp(logits - np.max(logits))
104
+ probs = exp_logits / np.sum(exp_logits)
105
+ return probs[1] # English probability
106
+
107
+ # Examples
108
+ print(predict("Hello world!")) # ~1.0 (English)
109
+ print(predict("Bonjour le monde")) # ~0.0 (French)
110
+ print(predict("Check our sale! 🎉")) # ~1.0 (English with emoji)
111
+ ```
112
+
113
+ ### Option 2: Python Package
114
+ ```bash
115
+ # Install the utility package
116
+ pip install innit-detector
117
+
118
+ # CLI usage
119
+ innit "Hello world!" # → English (confidence: 0.974)
120
+ innit --download # Download model first
121
+ innit "Hello" "Bonjour" "你好" # Multiple texts
122
+
123
+ # Library usage
124
+ from innit_detector import InnitDetector
125
+ detector = InnitDetector()
126
+ result = detector.predict("Hello world!")
127
+ print(result['is_english']) # True
128
+ ```
129
+
130
+ ### Option 3: PyTorch (Advanced)
131
+ ```python
132
+ import torch
133
+ import torch.nn.functional as F
134
+ from safetensors.torch import load_file
135
+ import numpy as np
136
+
137
+ # Load model (requires TinyByteCNN_EN class definition)
138
+ state_dict = load_file("model.safetensors")
139
+ model = TinyByteCNN_EN(emb=80, blocks=6, dropout=0.15)
140
+ model.load_state_dict(state_dict)
141
+ model.eval()
142
+
143
+ def predict(text):
144
+ bytes_data = text.encode('utf-8', errors='ignore')[:256]
145
+ padded = np.zeros(256, dtype=np.long)
146
+ padded[:len(bytes_data)] = list(bytes_data)
147
+
148
+ with torch.no_grad():
149
+ logits = model(torch.tensor(padded).unsqueeze(0))
150
+ probs = F.softmax(logits, dim=1)
151
+ return probs[0][1].item()
152
+ ```
153
+
154
+ ## ONNX Support
155
+
156
+ ONNX version available for cross-platform deployment:
157
+ - `model.onnx` - Full precision (FP32) for maximum compatibility
158
+
159
+ ## Challenge Set Results
160
+
161
+ Perfect 100% accuracy on comprehensive test cases:
162
+ - Ultra-short texts: "Good morning!" ✅
163
+ - Emoji handling: "Check out our sale! 🎉" ✅
164
+ - Mathematical formulas: "x = (-b ± √(b²-4ac))/2a" ✅
165
+ - Scientific notation: "CO₂ + H₂O → C₆H₁₂O₆" ✅
166
+ - Diverse scripts: Arabic, CJK, Cyrillic, Devanagari ✅
167
+ - English-like languages: Dutch, German ✅
168
+
169
+ ## Limitations
170
+
171
+ - Binary classification only (English vs Non-English)
172
+ - Optimized for texts up to 256 UTF-8 bytes
173
+ - May have reduced accuracy on very rare languages not in training data
174
+ - Not suitable for multilingual text (mixed languages in single input)
175
+
176
+ ## License
177
+
178
+ MIT License - free for commercial use.
config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "TinyByteCNN_EN"
4
+ ],
5
+ "model_type": "byte_cnn",
6
+ "emb_dim": 80,
7
+ "num_blocks": 6,
8
+ "dropout": 0.15,
9
+ "vocab_size": 257,
10
+ "num_classes": 2,
11
+ "max_length": 256,
12
+ "validation_accuracy": 99.94301994301995,
13
+ "torch_dtype": "float32"
14
+ }
model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:692e33fc0d94ab5ec9436c8b84853c4662e739b0a6f28110894c383a06f913ac
3
+ size 643861
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dcc8aae0bf9626072b33569b6097c73763029e62eaae3f6b0d571fbb426a061c
3
+ size 634264
model_info.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_name": "innit",
3
+ "version": "1.0",
4
+ "task": "english_detection",
5
+ "architecture": "TinyByteCNN_EN",
6
+ "parameters": 156642,
7
+ "input_format": "utf8_bytes",
8
+ "max_length": 256,
9
+ "output_classes": [
10
+ "NOT-EN",
11
+ "EN"
12
+ ],
13
+ "validation_accuracy": 99.94,
14
+ "challenge_accuracy": 100.0,
15
+ "files": {
16
+ "pytorch": "model.safetensors",
17
+ "config": "config.json",
18
+ "onnx": "model.onnx",
19
+ "readme": "README.md"
20
+ }
21
+ }