mlconvexai commited on
Commit
a3285ec
·
verified ·
1 Parent(s): 56da898

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +95 -0
README.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - google/gemma-2-2b-it
4
+ tags:
5
+ - text-generation-inference
6
+ - transformers
7
+ - unsloth
8
+ - gemma2
9
+ - trl
10
+ license: gemma
11
+ language:
12
+ - en
13
+ - fi
14
+ - sv
15
+ ---
16
+
17
+ This example utilizes the [European AI Act regulation text](https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689) as training data in three languages:
18
+ English, Finnish, and Swedish. The dataset comprises 9,175 data points for training and 2,456 for evaluation.
19
+
20
+ Python libraries needed:
21
+
22
+ ```python
23
+ pip install -U transformers
24
+ pip install torch torchvision torchaudio
25
+ pip install 'accelerate>=0.26.0'
26
+ ```
27
+
28
+ The training arguments used are as follows:
29
+
30
+ ```python
31
+ training_args = TrainingArguments(
32
+ per_device_train_batch_size=32,
33
+ gradient_accumulation_steps=32,
34
+ warmup_steps=20,
35
+ max_steps=400,
36
+ learning_rate=1.5e-5,
37
+ fp16=not is_bfloat16_supported(),
38
+ bf16=is_bfloat16_supported(),
39
+ logging_steps=1,
40
+ optim="adamw_8bit",
41
+ weight_decay=0.01,
42
+ lr_scheduler_type="cosine",
43
+ seed=3407,
44
+ output_dir=output_dir,
45
+ report_to="none",
46
+ eval_strategy="steps",
47
+ eval_steps=10,
48
+ load_best_model_at_end=True,
49
+ metric_for_best_model="eval_loss",
50
+ greater_is_better=False,
51
+ save_total_limit=2,
52
+ )
53
+ ```
54
+
55
+ The prediction is made using the standard Gemma:
56
+
57
+ ```python
58
+ from transformers import AutoTokenizer, AutoModelForCausalLM
59
+ import transformers
60
+ import torch
61
+
62
+ model_id = "mlconvexai/gemma-2-2b-it-finetuned-EU-Act-v2"
63
+ dtype = torch.bfloat16
64
+
65
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
66
+ model = AutoModelForCausalLM.from_pretrained(
67
+ model_id,
68
+ device_map="auto",
69
+ torch_dtype=dtype,)
70
+
71
+ chat = [
72
+ { "role": "user", "content": "Mikä on EU:n tekoälyasetus?" },
73
+ ]
74
+ prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
75
+ inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
76
+ outputs = model.generate(
77
+ input_ids=inputs.to(model.device),
78
+ max_new_tokens=1024,
79
+ repetition_penalty=1.1,
80
+ no_repeat_ngram_size=4,
81
+ )
82
+ print(tokenizer.decode(outputs[0]))
83
+ ```
84
+
85
+ More detailed information about fine-tuning can be found on [Medium](https://medium.com/@timo.au.laine/eu-ai-act-fine-tune-multilingual-local-llm-2c0657cc47f8).
86
+
87
+ # Uploaded model
88
+
89
+ - **Developed by:** mlconvexai
90
+ - **License:** Gemma
91
+ - **Finetuned from model :** google/gemma-2-2b-it
92
+
93
+ This gemma2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
94
+
95
+ [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)