transformers-community
/

group-beam-search

@@ -11,35 +11,7 @@ tags:
 Enable diverse beam search with the `num_beams`, `num_beam_groups` and `diversity_penalty` parameters (the `num_beams` parameter should be divisible by `num_beam_groups`).
-```py
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
-device = infer_device()
-tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
-inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to(device)
-model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", dtype=torch.float16).to(device)
-# explicitly set to 100 because Llama2 generation length is 4096
-outputs = model.generate(**inputs, max_new_tokens=50, num_beams=6, num_beam_groups=3, diversity_penalty=1.0, do_sample=False)
-tokenizer.batch_decode(outputs, skip_special_tokens=True)
-'Hugging Face is an open-source company 🤗\nWe are an open-source company. Our mission is to democratize AI and make it accessible to everyone. We believe that AI should be used for the benefit of humanity, not for the benefit of a'
-```
-DoLa works by **contrasting the logits** from the final layer with those from earlier layers of the model,
-amplifying factual knowledge localized in specific layers and suppressing spurious information.
-This can be useful for:
-* **Short-answer tasks** (e.g., TruthfulQA) — using higher layers (`dola_layers="high"`)
-* **Long-answer reasoning tasks** (e.g., GSM8K, StrategyQA, FACTOR, VicunaQA) — using lower layers (`dola_layers="low"`)
-DoLa is **not recommended for smaller models** such as GPT-2, as the improvement may be negligible.
-This implementation matches the `DoLa` functionality present in `transformers<4.53.0`.
 ---
@@ -57,23 +29,23 @@ This implementation matches the `DoLa` functionality present in `transformers<4.
 ## Additional Arguments
-* **`dola_layers`** (*str* or *List\[int]*, optional):
-  Which earlier layers to contrast with the final layer. Can be:
-  * `"low"` — lower half of layers (recommended for long answers)
-  * `"high"` — upper half of layers (recommended for short answers)
-  * List of integer indices (e.g., `[18, 20]`)
-  **Note:**
-  * Layer 0 is the word embedding; layer 1 is the first transformer block.
-  * If the model has tied word embeddings, layer 0 is skipped and counting starts at layer 2.
-  * Typical defaults:
-    | # Layers | `"low"` range       | `"high"` range      |
-    | -------- | ------------------- | ------------------- |
-    | > 40     | `(0, 20, 2)`        | `(N - 20, N, 2)`    |
-    | ≤ 40     | `range(0, N//2, 2)` | `range(N//2, N, 2)` |
 * **`repetition_penalty`** (*float*, optional, defaults to `None`):
   Helps reduce repetition. A value of `1.2` is recommended.
@@ -89,61 +61,18 @@ This implementation matches the `DoLa` functionality present in `transformers<4.
 ## Example usage
-### Using higher layers (short-answer tasks)
-```python
-# requires `transformers>=4.56.0`, previously, it was part of the library
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
 device = infer_device()
 tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
-model = AutoModelForCausalLM.from_pretrained(
-    "Qwen/Qwen3-0.6B", torch_dtype=torch.float16
-).to(device)
-inputs = tokenizer("What is the highest peak in the world?", return_tensors="pt").to(device)
-outputs = model.generate(
-    **inputs,
-    max_new_tokens=50,
-    do_sample=False,
-    custom_generate="transformers-community/dola",
-    trust_remote_code=True,
-    dola_layers="high"
-)
-print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
-```
----
-### Contrasting specific layers
-```python
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
-device = infer_device()
-tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
-model = AutoModelForCausalLM.from_pretrained(
-    "Qwen/Qwen3-0.6B", torch_dtype=torch.float16
-).to(device)
-inputs = tokenizer("What is the highest peak in the world?", return_tensors="pt").to(device)
-outputs = model.generate(
-    **inputs,
-    max_new_tokens=50,
-    do_sample=False,
-    repetition_penalty=1.2,
-    custom_generate="transformers-community/dola",
-    trust_remote_code=True,
-    dola_layers=[18, 20]
-)
-# Only decode the newly generated tokens
-print(tokenizer.batch_decode(outputs[:, inputs.input_ids.shape[-1]:], skip_special_tokens=True))
-```

 Enable diverse beam search with the `num_beams`, `num_beam_groups` and `diversity_penalty` parameters (the `num_beams` parameter should be divisible by `num_beam_groups`).
+This implementation matches the `group_beam_search` functionality present in `transformers<4.56.0`.
 ---
 ## Additional Arguments
+* **`num_beams`** (*int*, optional, defaults to `1`):
+  Number of beams for beam search. If not greater than `num_beam_groups`, will be set to `num_beam_groups`.
+* **`num_beam_groups`** (*int*, optional, defaults to `1`):
+  Number of groups to divide `num_beams` into for beam search.
+* **`diversity_penalty`** (*float*, optional, defaults to `0.0`):
+  Diversity penalty applied to beams.
+* **`early_stopping`** (*bool* or *str*, optional, defaults to `False`):
+  Whether to stop beam search when at least `num_beams` complete candidates are finished per batch or not. If not `False`, it should be an integer greater than 1 indicating the minimum number of beams required to be finished per batch.
+* **`max_length`** (*int*, optional, defaults to `20`):
+  The maximum length of the generated sequence.
+* **`num_return_sequences`** (*int*, optional, defaults to `1`):
+  The number of sequences to return.
 * **`repetition_penalty`** (*float*, optional, defaults to `None`):
   Helps reduce repetition. A value of `1.2` is recommended.
 ## Example usage
+```py
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
 device = infer_device()
 tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
+inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to(device)
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", dtype=torch.float16).to(device)
+# explicitly set to 100 because Llama2 generation length is 4096
+outputs = model.generate(**inputs, max_new_tokens=50, num_beams=6, num_beam_groups=3, diversity_penalty=1.0, do_sample=False, custom_generate="transformers-community/group-beam-search", trust_remote_code=True)
+tokenizer.batch_decode(outputs, skip_special_tokens=True)
+```