Safetensors
mistral
h-j-han commited on
Commit
2c8d07b
Β·
1 Parent(s): 77625b7

Fix new line issue & Match vocab type to base model

Browse files
README.md CHANGED
@@ -29,22 +29,24 @@ These are the merged version: after training the adapters, we merge the original
29
  ```python
30
  from transformers import AutoModelForCausalLM, AutoTokenizer
31
 
32
- # model_name = "mistralai/Mistral-7B-v0.1 # Base Model
33
  model_name = "h-j-han/Mistral-7B-VocADT-50k-Mixed" # Vocabulary Adapted Model
34
  tokenizer = AutoTokenizer.from_pretrained(model_name)
35
- model = AutoModelForCausalLM.from_pretrained(model_name)
36
 
37
  prefix = "\nEnglish: Hello \nKorean: μ•ˆλ…•ν•˜μ„Έμš” \nEnglish: Thank you\nKorean: κ³ λ§™μŠ΅λ‹ˆλ‹€\nEnglish: "
38
- line = "I lived in Korea for seven years"
39
  suffix = f"\nKorean:"
40
  prompt = prefix + line + suffix
41
 
42
  inputs = tokenizer(prompt, return_tensors="pt")
43
- outputs = model.generate(**inputs, max_new_tokens=8)
 
 
44
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
45
 
46
- # Base Model Output: "ν•œκ΅­μ— 7λ…„" # This short incomplete phrase in Korean is 8 tokens for the base model.
47
- # VocADT Output: "μ €λŠ” ν•œκ΅­μ— 7λ…„ λ™μ•ˆ μ‚΄μ•˜μŠ΅λ‹ˆλ‹€." # Complete and good output within 8 tokens
48
  ```
49
 
50
  ## Reference
 
29
  ```python
30
  from transformers import AutoModelForCausalLM, AutoTokenizer
31
 
32
+ # model_name = "mistralai/Mistral-7B-v0.1" # Base Model
33
  model_name = "h-j-han/Mistral-7B-VocADT-50k-Mixed" # Vocabulary Adapted Model
34
  tokenizer = AutoTokenizer.from_pretrained(model_name)
35
+ model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
36
 
37
  prefix = "\nEnglish: Hello \nKorean: μ•ˆλ…•ν•˜μ„Έμš” \nEnglish: Thank you\nKorean: κ³ λ§™μŠ΅λ‹ˆλ‹€\nEnglish: "
38
+ line = "I'm a student."
39
  suffix = f"\nKorean:"
40
  prompt = prefix + line + suffix
41
 
42
  inputs = tokenizer(prompt, return_tensors="pt")
43
+ for item in inputs:
44
+ inputs[item] = inputs[item].cuda()
45
+ outputs = model.generate(**inputs, max_new_tokens=88)
46
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
47
 
48
+ # Base Model Output: "λ‚˜λŠ” ν•™" # This short incomplete phrase in Korean is 5 tokens for the base model.
49
+ # VocADT Output: "μ €λŠ” ν•™μƒμž…λ‹ˆλ‹€." # Complete and good output within 5 tokens
50
  ```
51
 
52
  ## Reference
config.json CHANGED
@@ -21,5 +21,5 @@
21
  "torch_dtype": "bfloat16",
22
  "transformers_version": "4.43.0.dev0",
23
  "use_cache": true,
24
- "vocab_size": 50298
25
  }
 
21
  "torch_dtype": "bfloat16",
22
  "transformers_version": "4.43.0.dev0",
23
  "use_cache": true,
24
+ "vocab_size": 50000
25
  }
model-00001-of-00003.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b6c70409152bbd52d747ab55ffbf2c15298e11ef43e7bbd89e8271cc13fe7132
3
- size 4975618928
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b12b68690b89b00b5155b899e7af8c3ee1eeeb92c5c7715f7d001fd934b9f850
3
+ size 4973177712
model-00003-of-00003.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:4e450598da48018d7d2fa90b378429e31fa378cba197cea9253e623598f3e8ee
3
- size 4891757352
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:378dabdba3a3b4fcdb862f645af685249e5b02123d57c84bfd7f7ab4e23193dc
3
+ size 4889316136
model.safetensors.index.json CHANGED
@@ -1,6 +1,6 @@
1
  {
2
  "metadata": {
3
- "total_size": 14783258624
4
  },
5
  "weight_map": {
6
  "lm_head.weight": "model-00003-of-00003.safetensors",
 
1
  {
2
  "metadata": {
3
+ "total_size": 14778376192
4
  },
5
  "weight_map": {
6
  "lm_head.weight": "model-00003-of-00003.safetensors",
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff