Introduction
This model is a prototype of a large language model specifically fine-tuned for Fortran90 code generation. It is based on the Qwen 2.5 Coder 3B Instruct model and has been refined using Supervised Fine-Tuning and Reinforcement Learning with Verifiable Rewards (via GRPO).
This model was fine-tuned briefly, without any human-labeled data and using only a single consumer GPU. Despite these clear constraints, the training process led to a 400% boost in performance on tasks involving simple to moderately complex fortran program generation (HumanEval-like). Compilation errors dropped as well, and the model now performs close to much larger general-purpose models that weren’t specifically trained for this task.
Evaluation
Due to the lack of existing benchmarks for Fortran code generation, a quick-and-dirty adaptation of the HumanEval benchmark was created for Fortran in order to evaluate the model. This benchmark is currently under review and will be released publicly at a later date.
According to the current demo version of the FortranHumanEval benchmark:
Model | pass@1 | Compile Error Rate |
---|---|---|
FortranCodeGen 3B | 23.17% | 17.68% |
FortranCodeGen 3B only sft | 19.51% | 31.09% |
Qwen 2.5 Coder 3B Instruct | 5.48% | 63.41% |
GPT-4o mini | 18.90% | 43.90% |
GPT-4o | 32.31% | 17.07% |
Compared to its base model (Qwen 2.5 Coder 3B Instruct), FortranCodeGen 3B shows a strong improvement, increasing pass@1 accuracy from 5.48% to 23.17% and reducing the compile error rate from 63.41% to 17.68%. This highlights the effectiveness of this simple fine-tuning process, even though it was performed with limited resources: no human-labeled data, small synthetic dataset, and training on a single consumer GPU (L4 :'( ).
When compared to GPT-4o mini, FortranCodeGen 3B outperforms it in terms of both pass@1 accuracy (23.17% vs. 18.90%) and compile reliability (17.68% vs. 43.90%). This suggests that task-specific fine-tuning can produce better results than more general, (probably) larger models.
While it doesn't yet match the overall performance of GPT-4o, which achieves 32.31% pass@1, FortranCodeGen 3B reaches a comparable level of compilation correctness (17.68% vs. 17.07%), suggesting that its outputs are syntactically robust and close to executable, even when they don’t solve the full task.
These results confirm that targeted specialization can significantly enhance model performance on underrepresented tasks, and suggest a promising direction for very-low-resource fine-tuning in legacy or niche programming languages.
Uses
The model is highly specialized in generating Fortran90 programs that read input from stdin and print output to stdout. It’s recommended to use the model with a low temperature (or even disable sampling entirely) to maximize accuracy.
Before running any generated code, it’s always a good idea to check how the program handles input from stdin, especially for users who are new to Fortran.
Quick start
from transformers import pipeline
question = "Write me a Fortran program that, given an array of real numbers from stdin, prints the average."
generator = pipeline("text-generation", model="GiuLeo01/FortranCodeGen-3b-SynthData", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])
Training Details
Training Data
The goal of this experiment was to specialize a model in a complex task, like Fortran code generation, without using manually annotated data, which is particularly hard to find for this programming language.
Supervised Data
- A subset of the MBPP dataset (~600 examples) was selected.
- The task descriptions were automatically adapted to Fortran using precise instructions, using OpenAI o3-mini.
- Tasks were filtered using embeddings and manually reviewed to ensure that no examples too similar to HumanEval tasks were included in the training set.
- Each task was automatically labeled using three stronger (and bigger) models: OpenAI o3-mini, Qwen 2.5 Coder 32B, and OpenAI GPT-4o.
- Labels were automatically validated through unit tests.
- Only correct solutions were kept, at most one per task, prioritized in the following order: OpenAI o3-mini > Qwen 2.5 Coder 32B > OpenAI GPT-4o.
This simple process led to the creation of a small, synthetically labeled training set used for supervised fine-tuning.
IMPORTANT: Do not validate this model on MBPP-derived benchmarks due to the data overlap.
Reinforcement Learning with Verifiable Rewards Data
In this phase, both the programming tasks and their test cases were generated automatically using a large language model (OpenAI o3-mini).
- The model received detailed instructions regarding:
- the expected format of the task descriptions
- the difficulty level of the problems
- the structure and format of the test cases
To ensure a wide variety of tasks, 30 distinct themes were defined, including:
string manipulation and formatting, basic array processing (1D arrays), simple numeric sequences, frequency counting in arrays, finding prime numbers, basic sorting algorithms on 1D arrays, simple recursive functions, pattern detection in strings, calculating GCD and LCM, basic statistics (mean, median), string encoding/decoding, subarray sums, basic combinatorial calculations, bitwise operations, date and time manipulation, palindrome substring detection, basic hashing techniques, number base conversions, array rotation (1D), counting unique elements, string compression, validating numeric strings, string reversal with conditions, generating Fibonacci sequence, checking balanced parentheses, basic queue and stack problems (using 1D arrays), counting vowels and consonants, integer factorization, simple encryption/decryption, basic logical puzzles.
For each theme, the model was prompted once to generate 10 unique programming problems and their corresponding test cases.
This final step was key to generating high-quality synthetic data. Without a clearly defined theme, the model tends to repeat or default to similar types of tasks. By guiding generation through specific topics, I built a synthetic dataset of 300 examples, each composed of a task and a corresponding test case.
Training Procedure
Supervised Fine-Tuning
The annotated example dataset was split into training and validation sets (80/20 split), and used to perform full fine-tuning of the model.
Training was carried out for 10 epochs.
The key hyperparameters were:
- batch size = 4
- gradient accumulation steps = 4
- learning rate = 2e-5
- learning rate scheduler = cosine
- weight decay = 0.01
Reinforcement Learning with Verifiable Rewards
In this stage, a QLoRA adapter was trained using the GRPO algorithm to refine the generated Fortran programs. The goal was to reduce compilation errors and further improve the accuracy of the generated solutions.
The model was quantized to 4-bit, and a LoRA adapter was used with rank=32
and alpha=64
.
The reward function used throughout this phase was very simple:
- A reward of 1 was given if the generated program compiled successfully
- An additional 3 points were awarded if it passed the test case This results in a reward range of [0, 4].
The initial training phase was run for 3 epochs with:
- batch size = 16
- number of generations = 4
- learning rate = 1e-5
- learning rate scheduler = cosine
- weight decay = 0.1
- max gradient norm = 0.5
A second phase followed, resetting the learning rate to 1e-6
with a linear decay schedule.
Citation
If you use this model or parts of this work, please consider citing the references below.
References
Qwen/Qwen2.5-Coder-3B-Instruct https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct
OpenAI o3-mini https://platform.openai.com/docs/models
OpenAI GPT-4o https://platform.openai.com/docs/models
Group Relative Policy Optimization (GRPO) https://arxiv.org/abs/2402.03300
Unsloth – Fast and memory-efficient fine-tuning via QLoRA https://github.com/unslothai/unsloth
Hugging Face Transformers https://github.com/huggingface/transformers
Disclaimer on Use of Proprietary Models
Some of the training data used for this model was generated or labeled using proprietary large language models, including OpenAI o3-mini and GPT-4o. These models were used to synthesize programming tasks, adapt natural language descriptions, and automatically label code solutions for supervised fine-tuning and reinforcement learning. No raw outputs from these proprietary models are included in this repository or redistributed in any form. All generated data has been filtered, validated, and used solely to train a distinct, task-specific model.
This model is not intended to replicate or imitate any specific proprietary system, and is designed only for a specialized use case (program generation in Fortran) and for research purposes.
- Downloads last month
- 13