Model Card
A lightweight binary classifier that tells whether a Turkish input string is pure/partial code (CODE
) or ordinary natural language (NL
).
The model is designed as a guard-rail component in LLM pipelines:
if the user prompt is classified as CODE
, upstream orchestration can refuse to forward it to the LLM, apply rate limits, or route it to a different policy.
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import pipeline
clf = pipeline("text-classification",
model="yeniguno/turkish-code-detector",
tokenizer="yeniguno/turkish-code-detector")
prompt = "def faktoriyel(n):\n return 1 if n <= 1 else n * faktoriyel(n-1)"
result = clf(prompt)
print(f"Classification: {result}\n")
# Classification: [{'label': 'CODE', 'score': 0.999995231628418}]
prompt = "Linux'un yaratıcısı kimdir, biliyor musun?"
result = clf(prompt)
print(f"Classification: {result}\n")
# Classification: [{'label': 'NL', 'score': 0.9998611211776733}]
Intended Use & Limitations
✓ Recommended | ✗ Not a Good Fit |
---|---|
Prompt filtering in LLM stacks | Detecting specific programming languages |
Pre-screening user inputs in chat | Judging code quality or style |
Moderating public text fields | Detecting tiny inline code tokens in very long documents |
Fast, low-latency inference (≈1 ms on GPU) | Multilingual detection outside Turkish |
The classifier was trained only on Turkish text + polyglot code snippets.
Unseen languages (e.g. Japanese text) may be mis-labelled NL
.
Very short ambiguous strings (e.g. "int"
) can be mis-labelled CODE
.
Training Data
Split | Total | NL | CODE |
---|---|---|---|
Train | 316 732 | 251 518 | 65 214 |
Dev | 39 591 | 31 439 | 8 152 |
Test | 39 592 | 31 440 | 8 152 |
Training Hyperparameters
Setting | Value |
---|---|
Optimiser | AdamW |
Effective batch | 32 (2 × 16, fp16) |
LR scheduler | linear-decay, warm-up 0 |
Max length | 256 tokens |
Epochs | ≤ 10 (early-stopping at 6 k steps ≈ 0.30 epoch) |
Loss | Cross-entropy with reversed class weightsweight_NL = 10.0 weight_CODE = 1.0 |
Label smoothing | 0.1 |
Hardware | 1 × A100 40 GB (Google Colab) |
Evaluation
Split | Acc | Prec | Recall | F1 |
---|---|---|---|---|
Train | 0.9960 | 0.9978 | 0.9827 | 0.9902 |
Dev | 0.9957 | 0.9981 | 0.9807 | 0.9894 |
Test | 0.9954 | 0.9968 | 0.9807 | 0.9887 |
All metrics computed withid2label = {0: "NL", 1: "CODE"}
.
- Downloads last month
- 5
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support