kenhktsui
/

code-natural-language-fasttext-classifier

Text Classification

Model card Files Files and versions Community

code-natural-language-fasttext-classifier / README.md

kenhktsui's picture

Update README.md

135ad1a verified 8 months ago

|

history blame contribute delete

3.01 kB

	---
	license: mit
	datasets:
	- kenhktsui/code-natural-language-classification-dataset
	language:
	- en
	metrics:
	- f1
	pipeline_tag: text-classification
	library_name: fasttext
	---
	# code-natural-language-classification-dataset

	[Dataset](https://huggingface.co/datasets/kenhktsui/code-natural-language-classification-dataset)

	This classifier classifies a text into Code or NaturalLanguage.
	The model is trained over 3.24M records, which is a mix of code and natural langauge and achieved a test F1 score of 0.97.
	The classifier can be used for LLM pretraining data curation, to route a text into different pipeline (e.g. code syntax check).
	It is ultra fast ⚡ with a throughtput of ~2000 doc/s with CPU.


	## 🛠️Usage
	```python
	from typing import List
	import re
	from huggingface_hub import hf_hub_download
	import fasttext


	model_hf = fasttext.load_model(hf_hub_download("kenhktsui/code-natural-language-fasttext-classifier", "model.bin")) # "model_quantized.bin" for quantized version


	def replace_newlines(text: str) -> str:
	return re.sub("\n+", " ", text)


	def predict(text_list: List[str]) -> List[dict]:
	text_list = [replace_newlines(text) for text in text_list]
	pred = model.predict(text_list)
	return [{"label": l[0].lstrip("__label__"), "score": s[0]}
	for l, s in zip(*pred)]


	predict([
	"""This is a lightning fast model, which can classify at throughtput of 2000 doc/s with CPU""",
	"""import torch""",
	"""Short text won't work"""
	])
	# [{'label': 'NaturalLanguage', 'score': 0.96747404},
	# {'label': 'Code', 'score': 1.00001},
	# {'label': 'Code', 'score': 1.000009}]
	```
	## 📊Evaluation
	full version
	```
	precision recall f1-score support

	Code 0.97 1.00 0.98 581282
	NaturalLanguage 1.00 0.92 0.95 228993

	accuracy 0.98 810275
	macro avg 0.98 0.96 0.97 810275
	weighted avg 0.98 0.98 0.98 810275
	```

	quantized version
	```
	precision recall f1-score support

	Code 0.95 1.00 0.97 581282
	NaturalLanguage 1.00 0.86 0.93 228993

	micro avg 0.96 0.96 0.96 810275
	macro avg 0.97 0.93 0.95 810275
	weighted avg 0.96 0.96 0.96 810275
	```


	## 📝Definition of Label
	Code covers:
	```
	{'Assembly',
	'Batchfile',
	'C',
	'C#',
	'C++',
	'CMake',
	'CSS',
	'Dockerfile',
	'FORTRAN',
	'GO',
	'HTML',
	'Haskell',
	'Java',
	'JavaScript',
	'Julia',
	'Lua',
	'Makefile',
	'PHP',
	'Perl',
	'PowerShell',
	'Python',
	'Ruby',
	'Rust',
	'SQL',
	'Scala',
	'Shell',
	'TeX',
	'TypeScript',
	'Visual Basic'}
	```
	Markdown is disregarded as it has a high overlap with natural language.

	## ⚠️Known Limitation
	The classifier does not handle short text well, which might not be surprising.
	It has a tendency to classify short natural language into code, which you might find so in code comment.