--- license: mit datasets: - kenhktsui/code-natural-language-classification-dataset language: - en metrics: - f1 pipeline_tag: text-classification library_name: fasttext --- # code-natural-language-classification-dataset [Dataset](https://huggingface.co/datasets/kenhktsui/code-natural-language-classification-dataset) This classifier classifies a text into Code or NaturalLanguage. The model is trained over 3.24M records, which is a mix of code and natural langauge and achieved a test F1 score of 0.97. The classifier can be used for LLM pretraining data curation, to route a text into different pipeline (e.g. code syntax check). It is ultra fast ⚡ with a throughtput of ~2000 doc/s with CPU. ## 🛠️Usage ```python from typing import List import re from huggingface_hub import hf_hub_download import fasttext model_hf = fasttext.load_model(hf_hub_download("kenhktsui/code-natural-language-fasttext-classifier", "model.bin")) # "model_quantized.bin" for quantized version def replace_newlines(text: str) -> str: return re.sub("\n+", " ", text) def predict(text_list: List[str]) -> List[dict]: text_list = [replace_newlines(text) for text in text_list] pred = model.predict(text_list) return [{"label": l[0].lstrip("__label__"), "score": s[0]} for l, s in zip(*pred)] predict([ """This is a lightning fast model, which can classify at throughtput of 2000 doc/s with CPU""", """import torch""", """Short text won't work""" ]) # [{'label': 'NaturalLanguage', 'score': 0.96747404}, # {'label': 'Code', 'score': 1.00001}, # {'label': 'Code', 'score': 1.000009}] ``` ## 📊Evaluation full version ``` precision recall f1-score support Code 0.97 1.00 0.98 581282 NaturalLanguage 1.00 0.92 0.95 228993 accuracy 0.98 810275 macro avg 0.98 0.96 0.97 810275 weighted avg 0.98 0.98 0.98 810275 ``` quantized version ``` precision recall f1-score support Code 0.95 1.00 0.97 581282 NaturalLanguage 1.00 0.86 0.93 228993 micro avg 0.96 0.96 0.96 810275 macro avg 0.97 0.93 0.95 810275 weighted avg 0.96 0.96 0.96 810275 ``` ## 📝Definition of Label Code covers: ``` {'Assembly', 'Batchfile', 'C', 'C#', 'C++', 'CMake', 'CSS', 'Dockerfile', 'FORTRAN', 'GO', 'HTML', 'Haskell', 'Java', 'JavaScript', 'Julia', 'Lua', 'Makefile', 'PHP', 'Perl', 'PowerShell', 'Python', 'Ruby', 'Rust', 'SQL', 'Scala', 'Shell', 'TeX', 'TypeScript', 'Visual Basic'} ``` Markdown is disregarded as it has a high overlap with natural language. ## ⚠️Known Limitation The classifier does not handle short text well, which might not be surprising. It has a tendency to classify short natural language into code, which you might find so in code comment.