|
--- |
|
license: mit |
|
datasets: |
|
- kenhktsui/code-natural-language-classification-dataset |
|
language: |
|
- en |
|
metrics: |
|
- f1 |
|
pipeline_tag: text-classification |
|
library_name: fasttext |
|
--- |
|
# code-natural-language-classification-dataset |
|
|
|
[Dataset](https://huggingface.co/datasets/kenhktsui/code-natural-language-classification-dataset) |
|
|
|
This classifier classifies a text into Code or NaturalLanguage. |
|
The model is trained over 3.24M records, which is a mix of code and natural langauge and achieved a test F1 score of 0.97. |
|
The classifier can be used for LLM pretraining data curation, to route a text into different pipeline (e.g. code syntax check). |
|
It is ultra fast ⚡ with a throughtput of ~2000 doc/s with CPU. |
|
|
|
|
|
## 🛠️Usage |
|
```python |
|
from typing import List |
|
import re |
|
from huggingface_hub import hf_hub_download |
|
import fasttext |
|
|
|
|
|
model_hf = fasttext.load_model(hf_hub_download("kenhktsui/code-natural-language-fasttext-classifier", "model.bin")) # "model_quantized.bin" for quantized version |
|
|
|
|
|
def replace_newlines(text: str) -> str: |
|
return re.sub("\n+", " ", text) |
|
|
|
|
|
def predict(text_list: List[str]) -> List[dict]: |
|
text_list = [replace_newlines(text) for text in text_list] |
|
pred = model.predict(text_list) |
|
return [{"label": l[0].lstrip("__label__"), "score": s[0]} |
|
for l, s in zip(*pred)] |
|
|
|
|
|
predict([ |
|
"""This is a lightning fast model, which can classify at throughtput of 2000 doc/s with CPU""", |
|
"""import torch""", |
|
"""Short text won't work""" |
|
]) |
|
# [{'label': 'NaturalLanguage', 'score': 0.96747404}, |
|
# {'label': 'Code', 'score': 1.00001}, |
|
# {'label': 'Code', 'score': 1.000009}] |
|
``` |
|
## 📊Evaluation |
|
full version |
|
``` |
|
precision recall f1-score support |
|
|
|
Code 0.97 1.00 0.98 581282 |
|
NaturalLanguage 1.00 0.92 0.95 228993 |
|
|
|
accuracy 0.98 810275 |
|
macro avg 0.98 0.96 0.97 810275 |
|
weighted avg 0.98 0.98 0.98 810275 |
|
``` |
|
|
|
quantized version |
|
``` |
|
precision recall f1-score support |
|
|
|
Code 0.95 1.00 0.97 581282 |
|
NaturalLanguage 1.00 0.86 0.93 228993 |
|
|
|
micro avg 0.96 0.96 0.96 810275 |
|
macro avg 0.97 0.93 0.95 810275 |
|
weighted avg 0.96 0.96 0.96 810275 |
|
``` |
|
|
|
|
|
## 📝Definition of Label |
|
Code covers: |
|
``` |
|
{'Assembly', |
|
'Batchfile', |
|
'C', |
|
'C#', |
|
'C++', |
|
'CMake', |
|
'CSS', |
|
'Dockerfile', |
|
'FORTRAN', |
|
'GO', |
|
'HTML', |
|
'Haskell', |
|
'Java', |
|
'JavaScript', |
|
'Julia', |
|
'Lua', |
|
'Makefile', |
|
'PHP', |
|
'Perl', |
|
'PowerShell', |
|
'Python', |
|
'Ruby', |
|
'Rust', |
|
'SQL', |
|
'Scala', |
|
'Shell', |
|
'TeX', |
|
'TypeScript', |
|
'Visual Basic'} |
|
``` |
|
Markdown is disregarded as it has a high overlap with natural language. |
|
|
|
## ⚠️Known Limitation |
|
The classifier does not handle short text well, which might not be surprising. |
|
It has a tendency to classify short natural language into code, which you might find so in code comment. |
|
|