stack-edu-classifier-java

This is a classifier for scoring the educational value of code files in The Stack v2 dataset, it is a finetuned version of bigcode/starencoder with a classification head on code files annotated by Llama3.1-70B-Instruct. We use this classifier for building Stack-Edu dataset used for training SmolLM2, see paper. Each classifier is trained on one programming language.

How to use in transformers

To load the classifier, use the following code:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained(REPO_NAME)
model = AutoModelForSequenceClassification.from_pretrained(REPO_NAME)

text = "This is a test sentence."
inputs = tokenizer(text, return_tensors="pt", padding="longest", truncation=True)
outputs = model(**inputs)
logits = outputs.logits.squeeze(-1).float().detach().numpy()
score = logits.item()
result = {
    "text": text,
    "score": score,
    "int_score": int(round(max(0, min(score, 5)))),
}

print(result)
# {'text': 'This is a test sentence.', 'score': 0.07964489609003067, 'int_score': 0}

Intended uses & limitations

While the classifier performs well in distinguishing high-quality code in its target language (Java in this case), there are some limitations:

Scope: The model's performance might change for other datasets, in particular for out of distribution samples. The classifier's context is 1024 tokens, which might not be sufficient to assess the quality of some long code files.
Bias: The model's performance is dependent on the quality and representativeness of the training data and the LLM used for the annotation. Biases in both can affect the classifier's judgments. It might overfit to thoroughly commented code.
Context: The classifier evaluates individual code files without considering broader context, which might impact its effectiveness in certain scenarios.

The training and inference code is available on GitHub https://github.com/huggingface/cosmopedia/tree/main/classification

Training procedure

The classifier was trained on 500,000 pairs of code files and their scores from 0 to 5, generated by Llama3.1. The samples were annotated based on their educational quality with 1 being not educational and 5 being highly educational and relevant for teaching programming. You can find the prompt used for building the annotations in the appendix of SmolLM2 paper.

We added a classification head with a single regression output to StarEncoder and trained the model for 20 epochs with a learning rate of 3e-4. During training, the embedding and encoder layers were frozen to focus on the classification head.

It achieves the following results on the evaluation set:

Loss: 0.3339
Precision: 0.7086
Recall: 0.4035
F1 Macro: 0.4297
Accuracy: 0.6370
F1 Binary Minimum3: 0.7202
F1 Binary Minimum2: 0.9386

While the macro F1 scores across the 1-5 rating scale are relatively low due to the model's difficulty in distinguishing between higher-rated samples, the classifier performs well for our primary filtering task. When converting to binary classification, using a threshold of 2 achieves the F1 scores ranges between 0.8 and 0.9 for most Stack-Edu classifiers, whereas a threshold of 3 yields F1 scores between 0.5 and 0.8. With the Highest being Python, SQL, C, Rust and the lowest being HTML, TypeScript and C#.

The table below shows Stack-Edu dataset statistics and MultiPL-E scores for the top 4 (in terms of size) programming languages. We use HumanEval for Python evaluation. For the ablation, we started from a mid-training checkpoint of SmolLM2 at 3T tokens which was trained primarily on web data, and perform linear annealing on 200B tokens, uniformly distributed across 15 of the most commonly used programming languages (~14B tokens each).

Language	Before filtering (B tokens)	After filtering (B tokens)	MultiPL-E (Original → Filtered)
Python	50.6	21.8	20.7 → 25.6
C++	69.7	16.0	16.7 → 24.8
JavaScript	45.3	11.1	18.2 → 22.4
Java	45.6	42.1	17.6 → 22.7

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0003
train_batch_size: 64
eval_batch_size: 256
seed: 0
distributed_type: multi-GPU
num_devices: 2
total_train_batch_size: 128
total_eval_batch_size: 512
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 200
num_epochs: 20

License

Apache 2.0

Citation

@misc{allal2025smollm2smolgoesbig,
      title={SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model}, 
      author={Loubna Ben Allal and Anton Lozhkov and Elie Bakouch and Gabriel Martín Blázquez and Guilherme Penedo and Lewis Tunstall and Andrés Marafioti and Hynek Kydlíček and Agustín Piqueres Lajarín and Vaibhav Srivastav and Joshua Lochner and Caleb Fahlgren and Xuan-Son Nguyen and Clémentine Fourrier and Ben Burtenshaw and Hugo Larcher and Haojun Zhao and Cyril Zakka and Mathieu Morlon and Colin Raffel and Leandro von Werra and Thomas Wolf},
      year={2025},
      eprint={2502.02737},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.02737}, 
}

HuggingFaceTB
/

stack-edu-classifier-java

stack-edu-classifier-java

How to use in transformers

Intended uses & limitations

Training procedure

Training hyperparameters

License

Citation

Model tree for HuggingFaceTB/stack-edu-classifier-java

Collection including HuggingFaceTB/stack-edu-classifier-java

The Ultimate Collection of Code Classifiers

Evaluation results