You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

🏴󠁧󠁢󠁷󠁬󠁳󠁿 phi4-mini-welsh

Initial Welsh fine-tuned Phi-4-mini model trained on authentic Welsh datasets

This is a standalone merged model that includes Welsh language improvements and can be used directly.

Model Details

Base Model: microsoft/Phi-4-mini-instruct
Training Method: LoRA (Low-Rank Adaptation) using Unsloth
Language: Welsh (Cymraeg) with English support
Model Type: Merged Causal Language Model
Training Date: 2025-08-18 09:10:39 UTC
Welsh Tokens: Extended tokenizer with Welsh-specific tokens

Training Configuration

LoRA Rank: 64
LoRA Alpha: 32
Learning Rate: 0.0001
Batch Size: 2
Epochs: 3
Max Sequence Length: 2048

Training Data

This model was trained on the following Welsh language datasets:

Banc Trawsgrifiadau Bangor (techiaith/banc-trawsgrifiadau-bangor)
Common Voice Welsh 22.0 (techiaith/commonvoice_22_0_cy)
Welsh Legislation (techiaith/legislation-gov-uk_en-cy)
Welsh Wikipedia (1000 articles)

Total estimated tokens: 500K-800K authentic Welsh tokens

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the model (standalone with Welsh tokens)
model = AutoModelForCausalLM.from_pretrained("DewiBrynJones/phi4-mini-welsh")
tokenizer = AutoTokenizer.from_pretrained("DewiBrynJones/phi4-mini-welsh")

# Generate Welsh text
prompt = "<welsh>Bore da, sut mae"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50, do_sample=True, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Model Capabilities

✅ What the model can do:

Generate natural Welsh text
Understand Welsh sentence structure and mutations
Handle Welsh-English code-switching
Recognize Welsh idioms and expressions
Process both formal and colloquial Welsh

⚠️ Limitations:

Still learning - may make grammatical errors
Limited to training data knowledge
May occasionally mix languages inappropriately

Training Infrastructure

Framework: Unsloth + PyTorch
Hardware: NVIDIA GPU with CUDA support
Optimization: 4-bit quantization for efficiency
Memory: Gradient checkpointing enabled

Ethics and Bias

This model has been trained on publicly available Welsh language datasets. It may reflect biases present in the training data. Users should be aware of potential limitations when using the model for sensitive applications.

Acknowledgments

Techiaith (Bangor University) for Welsh language datasets
Unsloth for efficient training framework
Microsoft for the base Phi-4-mini model
Welsh language community for data contributions

License

This model is released under the MIT License. Please respect the original licenses of the training datasets.

Citation

If you use this model, please consider citing:

@misc{DewiBrynJones_phi4-mini-welsh,
  title={phi4-mini-welsh: Welsh Fine-tuned Phi-4-mini},
  author={DewiBrynJones},
  year={2025},
  url={https://huggingface.co/DewiBrynJones/phi4-mini-welsh}
}

Downloads last month: 2

Safetensors

Model size

4B params

Tensor type

F16

Model tree for DewiBrynJones/phi4-mini-welsh

Base model

microsoft/Phi-4-mini-instruct

Adapter

(119)

this model

Datasets used to train DewiBrynJones/phi4-mini-welsh

Evaluation results

Metadata error: specify a dataset to view leaderboard