Model Card for Qwen2.5-3B-Instruct-SLDS

Model Summary

This model is a Qwen2.5-3B-Instruct fine-tuned on the Swiss Landmark Decisions Summarization (SLDS) dataset.
SLDS is a multilingual dataset of 20,000 Swiss Federal Supreme Court decisions (1954–2024), each paired with headnotes in German, French, and Italian, resulting in ~60,000 decision–headnote pairs.

The model is optimized for legal abstractive summarization and is capable of producing concise, legally structured headnotes.
It can be used for both monolingual and cross-lingual summarization tasks.

This model was trained 2x faster with Unsloth and Huggingface's TRL library.

Intended Use

Primary Task: Judicial summarization (decision → headnote generation).
Languages: German (de), French (fr), Italian (it).
Scenarios:
- Monolingual summarization: e.g., German decision → German headnote.
- Cross-lingual summarization: e.g., German decision → French headnote.
- Legal research support: assisting in retrieval and navigation of court decisions.

Not intended for:

Replacing human legal expertise.
Serving as an authoritative legal source.
Automated legal advice or decision-making.

Training Data

Dataset: Swiss Landmark Decisions Summarization (SLDS).
Size: ~20K decisions, ~60K decision–headnote pairs.
Splits: Train (1954–2021), Validation (2022), Test (2023–2024).
Source: Swiss Federal Supreme Court.

Training Procedure

Base Models:
- Qwen2.5 family (0.5B–14B)
- Llama 3.2 (3B)
- Phi-3.5-mini
Fine-tuning Objective: Conditional generation (decision → headnote).
Evaluation Metrics:
- Lexical: ROUGE-1/2/L, BLEU, BERTScore.
- Domain-specific: LLM-as-a-Judge framework (DeepSeek V3) assessing five rubrics: accuracy, completeness, clarity, legal citations, and considerations.

Model Performance

On the SLDS test set (2023–2024):

Model	Setting	BERTScore ↑	BLEU ↑	ROUGE-1 ↑	ROUGE-2 ↑	ROUGE-L ↑	JUDGE ↑
Phi-3.5-mini	fine-tuned	11.24 ± 3.82	34.84 ± 0.41	31.20 ± 2.08	14.11 ± 1.27	20.96 ± 1.35	15.25 ± 2.32
Llama 3.2B	fine-tuned	15.20 ± 4.40	21.89 ± 0.42	31.89 ± 2.34	14.87 ± 1.61	22.49 ± 1.60	18.47 ± 2.99
Qwen2.5 0.5B	fine-tuned	-1.37 ± 3.85	32.20 ± 0.35	23.87 ± 1.68	9.46 ± 0.94	17.37 ± 1.09	5.80 ± 1.26
Qwen2.5 1.5B	fine-tuned	19.81 ± 2.72	36.79 ± 0.34	33.03 ± 1.73	14.14 ± 1.08	22.67 ± 1.13	15.92 ± 2.27
Qwen2.5 3B	fine-tuned	23.23 ± 2.80	38.42 ± 0.34	35.18 ± 1.79	15.66 ± 1.23	24.10 ± 1.17	20.31 ± 2.66
Qwen2.5 7B	fine-tuned	29.59 ± 1.97	41.40 ± 0.34	39.24 ± 1.59	18.26 ± 1.25	26.44 ± 1.15	28.37 ± 3.07
Qwen2.5 14B	fine-tuned	32.48 ± 1.98	41.80 ± 0.37	40.04 ± 1.74	19.99 ± 1.41	28.00 ± 1.28	31.38 ± 3.19
GPT-4o	one-shot	30.44 ± 1.74	31.89 ± 0.25	42.12 ± 1.79	18.92 ± 1.22	25.92 ± 1.05	39.70 ± 2.66
Claude 3.5 Sonnet	one-shot	5.53 ± 2.00	21.88 ± 0.25	41.86 ± 1.64	19.23 ± 1.19	27.67 ± 1.20	41.25 ± 2.90
DeepSeek-R1	one-shot	20.28 ± 1.45	22.37 ± 0.18	38.30 ± 1.82	15.97 ± 0.85	21.03 ± 0.84	42.28 ± 2.21
o3-mini	one-shot	14.18 ± 1.31	20.55 ± 0.17	34.77 ± 1.43	11.92 ± 0.69	18.21 ± 0.67	34.82 ± 2.41

Lexical metrics: Fine-tuned models outperform in overlap-based scores.
LLM-judge scores: Larger proprietary and reasoning models outperform in legal precision.

Limitations

Language imbalance: German decisions dominate, while Italian remains underrepresented.
Biases: Headnotes reflect judicial style and conventions, not neutral summaries.
Evaluation mismatch: ROUGE and BLEU may not fully capture legal accuracy.
Overfitting risk: Models may overfit to formulaic headnote structures.
Cross-lingual difficulty: Some models struggle with non-monolingual headnote generation.

Ethical Considerations

Sensitive information: All data is anonymized by the Swiss Federal Supreme Court before publication.
Legal risk: Generated headnotes must not be used as official legal advice.
Fair use: Ensure attribution when reusing outputs.

How to Cite

If you use this model, please cite the dataset paper:

@article{rolshoven2025slds,
      title={Unlocking Legal Knowledge: A Multilingual Dataset for Judicial Summarization in Switzerland}, 
      author={Luca Rolshoven and Vishvaksenan Rasiah and Srinanda Brügger Bose and Sarah Hostettler and Lara Burkhalter and Matthias Stürmer and Joel Niklaus},
      year={2025},
      eprint={2410.13456},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.13456}, 
}