manueldeprada HF Staff commited on
Commit
b60e744
·
verified ·
1 Parent(s): 396a9d9

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +21 -92
README.md CHANGED
@@ -11,35 +11,7 @@ tags:
11
 
12
  Enable diverse beam search with the `num_beams`, `num_beam_groups` and `diversity_penalty` parameters (the `num_beams` parameter should be divisible by `num_beam_groups`).
13
 
14
- ```py
15
- import torch
16
- from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
17
-
18
- device = infer_device()
19
-
20
- tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
21
- inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to(device)
22
-
23
- model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", dtype=torch.float16).to(device)
24
- # explicitly set to 100 because Llama2 generation length is 4096
25
- outputs = model.generate(**inputs, max_new_tokens=50, num_beams=6, num_beam_groups=3, diversity_penalty=1.0, do_sample=False)
26
- tokenizer.batch_decode(outputs, skip_special_tokens=True)
27
- 'Hugging Face is an open-source company 🤗\nWe are an open-source company. Our mission is to democratize AI and make it accessible to everyone. We believe that AI should be used for the benefit of humanity, not for the benefit of a'
28
- ```
29
-
30
-
31
-
32
- DoLa works by **contrasting the logits** from the final layer with those from earlier layers of the model,
33
- amplifying factual knowledge localized in specific layers and suppressing spurious information.
34
-
35
- This can be useful for:
36
-
37
- * **Short-answer tasks** (e.g., TruthfulQA) — using higher layers (`dola_layers="high"`)
38
- * **Long-answer reasoning tasks** (e.g., GSM8K, StrategyQA, FACTOR, VicunaQA) — using lower layers (`dola_layers="low"`)
39
-
40
- DoLa is **not recommended for smaller models** such as GPT-2, as the improvement may be negligible.
41
-
42
- This implementation matches the `DoLa` functionality present in `transformers<4.53.0`.
43
 
44
  ---
45
 
@@ -57,23 +29,23 @@ This implementation matches the `DoLa` functionality present in `transformers<4.
57
 
58
  ## Additional Arguments
59
 
60
- * **`dola_layers`** (*str* or *List\[int]*, optional):
61
- Which earlier layers to contrast with the final layer. Can be:
62
 
63
- * `"low"` lower half of layers (recommended for long answers)
64
- * `"high"` upper half of layers (recommended for short answers)
65
- * List of integer indices (e.g., `[18, 20]`)
66
 
67
- **Note:**
 
68
 
69
- * Layer 0 is the word embedding; layer 1 is the first transformer block.
70
- * If the model has tied word embeddings, layer 0 is skipped and counting starts at layer 2.
71
- * Typical defaults:
 
 
72
 
73
- | # Layers | `"low"` range | `"high"` range |
74
- | -------- | ------------------- | ------------------- |
75
- | > 40 | `(0, 20, 2)` | `(N - 20, N, 2)` |
76
- | ≤ 40 | `range(0, N//2, 2)` | `range(N//2, N, 2)` |
77
 
78
  * **`repetition_penalty`** (*float*, optional, defaults to `None`):
79
  Helps reduce repetition. A value of `1.2` is recommended.
@@ -89,61 +61,18 @@ This implementation matches the `DoLa` functionality present in `transformers<4.
89
 
90
  ## Example usage
91
 
92
- ### Using higher layers (short-answer tasks)
93
 
94
- ```python
95
- # requires `transformers>=4.56.0`, previously, it was part of the library
96
  import torch
97
  from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
98
 
99
  device = infer_device()
100
 
101
  tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
102
- model = AutoModelForCausalLM.from_pretrained(
103
- "Qwen/Qwen3-0.6B", torch_dtype=torch.float16
104
- ).to(device)
105
-
106
- inputs = tokenizer("What is the highest peak in the world?", return_tensors="pt").to(device)
107
-
108
- outputs = model.generate(
109
- **inputs,
110
- max_new_tokens=50,
111
- do_sample=False,
112
- custom_generate="transformers-community/dola",
113
- trust_remote_code=True,
114
- dola_layers="high"
115
- )
116
-
117
- print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
118
- ```
119
-
120
- ---
121
-
122
- ### Contrasting specific layers
123
-
124
- ```python
125
- import torch
126
- from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
127
-
128
- device = infer_device()
129
 
130
- tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
131
- model = AutoModelForCausalLM.from_pretrained(
132
- "Qwen/Qwen3-0.6B", torch_dtype=torch.float16
133
- ).to(device)
134
-
135
- inputs = tokenizer("What is the highest peak in the world?", return_tensors="pt").to(device)
136
-
137
- outputs = model.generate(
138
- **inputs,
139
- max_new_tokens=50,
140
- do_sample=False,
141
- repetition_penalty=1.2,
142
- custom_generate="transformers-community/dola",
143
- trust_remote_code=True,
144
- dola_layers=[18, 20]
145
- )
146
-
147
- # Only decode the newly generated tokens
148
- print(tokenizer.batch_decode(outputs[:, inputs.input_ids.shape[-1]:], skip_special_tokens=True))
149
- ```
 
11
 
12
  Enable diverse beam search with the `num_beams`, `num_beam_groups` and `diversity_penalty` parameters (the `num_beams` parameter should be divisible by `num_beam_groups`).
13
 
14
+ This implementation matches the `group_beam_search` functionality present in `transformers<4.56.0`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
  ---
17
 
 
29
 
30
  ## Additional Arguments
31
 
32
+ * **`num_beams`** (*int*, optional, defaults to `1`):
33
+ Number of beams for beam search. If not greater than `num_beam_groups`, will be set to `num_beam_groups`.
34
 
35
+ * **`num_beam_groups`** (*int*, optional, defaults to `1`):
36
+ Number of groups to divide `num_beams` into for beam search.
 
37
 
38
+ * **`diversity_penalty`** (*float*, optional, defaults to `0.0`):
39
+ Diversity penalty applied to beams.
40
 
41
+ * **`early_stopping`** (*bool* or *str*, optional, defaults to `False`):
42
+ Whether to stop beam search when at least `num_beams` complete candidates are finished per batch or not. If not `False`, it should be an integer greater than 1 indicating the minimum number of beams required to be finished per batch.
43
+
44
+ * **`max_length`** (*int*, optional, defaults to `20`):
45
+ The maximum length of the generated sequence.
46
 
47
+ * **`num_return_sequences`** (*int*, optional, defaults to `1`):
48
+ The number of sequences to return.
 
 
49
 
50
  * **`repetition_penalty`** (*float*, optional, defaults to `None`):
51
  Helps reduce repetition. A value of `1.2` is recommended.
 
61
 
62
  ## Example usage
63
 
 
64
 
65
+ ```py
 
66
  import torch
67
  from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
68
 
69
  device = infer_device()
70
 
71
  tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
72
+ inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to(device)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
 
74
+ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", dtype=torch.float16).to(device)
75
+ # explicitly set to 100 because Llama2 generation length is 4096
76
+ outputs = model.generate(**inputs, max_new_tokens=50, num_beams=6, num_beam_groups=3, diversity_penalty=1.0, do_sample=False, custom_generate="transformers-community/group-beam-search", trust_remote_code=True)
77
+ tokenizer.batch_decode(outputs, skip_special_tokens=True)
78
+ ```