recursal/QRWKV6-7B-Base · Model produces gibberish

Sep 16

Hi, I'm trying to use this model but I can't get it to produce coherent text.

For example, the following code:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "recursal/QRWKV6-7B-Base"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    torch_dtype=torch.float32,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
eos = tokenizer.convert_tokens_to_ids("<|endoftext|>")

prompt = "Write a short paragraph about the Moon."

# without chat template
x = tokenizer(prompt, return_tensors="pt").to(model.device)
y = model.generate(**x, max_new_tokens=32, do_sample=False, eos_token_id=eos, pad_token_id=eos)
print(tokenizer.decode(y[0][x["input_ids"].shape[1]:], skip_special_tokens=True))

# with chat template
messages = [{"role":"user","content": prompt}]
chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
x = tokenizer(chat, return_tensors="pt").to(model.device)
y = model.generate(**x, max_new_tokens=160, do_sample=False, eos_token_id=eos, pad_token_id=eos)
print(tokenizer.decode(y[0][x["input_ids"].shape[1]:], skip_special_tokens=True))

Produces:

's LI堞FontAwesomeIcon/GPL\views@student灏@student/GPL\views@student chù/GPL怏 John/



 known-Identifier strugg约翰WARDS@studentWARDS McC ли@student@student@student@student@student@student
 ormsg'iconGetEnumerator nodeSharper\views disappe Wis ли EntityState@student@student Mik@student@student@student@student@student@student@student@student@student@student@student@student@student@student@student@student@student@student

I tried different settings, but all produce gibberish:

different torch_dtype
GPU vs CPU
with / without chat template
different decoding strategies

Could you share a code template that produces non-gibberish text? Is there a known issue with this model? (Note: I'm not interested in the Instruct model, but in the Base model.)

Thanks in advance for your help!

smdrnks

Sep 16

Same here. The instruction models seem to work fine but I also could not get the base model to output coherent completions.

davidstap

Sep 16

Yes I can confirm that recursal/QRWKV7-7B-Instruct works. However, recursal/QRWKV6-7B-Instruct (note 6 instead of 7) also produces gibberish.

SmerkyG

recursal org Sep 16

There were apparently two issues, related to updates to the FLA and Transformers libraries. I've fixed them in both repos so please let me know if this works for you now!

davidstap

Sep 16

@SmerkyG It's working now, thanks a lot for your help!

davidstap changed discussion status to closed Sep 16

davidstap changed discussion status to open Sep 17

davidstap

Sep 17

During further investigation, I uncovered additional issues. Specifically, when evaluating the model with lm_eval, non-generation tasks run correctly, but generation tasks fail. Moreover, my attempted fix results in 0.0 scores with outputs that are just repetitive loops.

Non-generation tasks work fine

For example, piqa runs without issues and produces reasonable scores:

lm_eval --model hf  \
    --model_args pretrained=recursal/QRWKV6-7B-Base \
    --trust_remote_code \
    --tasks piqa \
    --device cuda:0 \
    --batch_size 8 \
    --limit 5

Output:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
piqa	1	none	0	acc	↑	0.8	±	0.2000
		none	0	acc_norm	↑	0.6	±	0.2449

Generation tasks fail

Running a generation task like gsm8k (only change to lm_eval command above is to have --tasks gsm8k instead of --task piqa) triggers an error:

  File "/nfs-gpu/users_home/davidstap/.cache_hf/modules/transformers_modules/recursal/QRWKV6-7B-Base/62932c0601c8bf45f6249f75f495baac87909981/modeling_rwkv6qwen2.py", line 446, in forward
    attn_output = attn_output * g
                  ~~~~~~~~~~~~^~~
RuntimeError: The size of tensor a (17920) must match the size of tensor b (3584) at non-singleton dimension 2

I suspected this was related to incorrect handling of left padding masks, so I implemented a small patch. However, with the patch applied, the model scores 0.0 and generates repetitive nonsense.

Example of faulty generation

Prompt (from gsm8k):

Question: Samantha bought a crate of 30 eggs for $5. If she decides to sell each egg for 20 cents, how many eggs will she have left by the time she recovers her capital from the sales?
Answer: There are 100 cents in each $1 so $5 gives 5*100 cents = <<5*100=500>>500 cents
To recover her capital of 500 cents from a selling price of 20 cents per egg she has to sell 500/20 = <<500/20=25>>25 eggs
There were 30 eggs in the crate to start with so she will have 30-25 = <<30-25=5>>5 eggs left
#### 5

Question: The teacher agrees to order pizza for the class. For every student in the class, she will buy 2 pieces of cheese and 1 piece of onion and they will eat exactly that amount. A large pizza has 18 slices. She orders 6 total pizzas and there are 8 pieces of cheese and 4 pieces of onion leftover. How many students are in the class?
Answer: Cheese pizzas are 2/3 of the pizza's purchased because 2 / (2+1) = 2/3
She buys 4 cheese pizzas because 6 x (2/3) = <<6*2/3=4>>4
These give her 72 pieces of cheese pizza because 4 x 18 = <<4*18=72>>72
The students at 64 pieces of cheese because 72 - 8 = <<72-8=64>>64
There are 32 students in her class because 64 / 2 = <<64/2=32>>32
#### 32

Question: Sandra wants to buy some sweets. She saved $10 for this purpose. Her mother gave her an additional $4, and her father twice as much as her mother. One candy costs $0.5, and one jelly bean $0.2. She wants to buy 14 candies and 20 jelly beans. How much money will she be left with after the purchase?
Answer: Sandra's father gave her $4 * 2 = $<<4*2=8>>8.
So Sandra has in total $8 + $4 + $10 = $<<8+4+10=22>>22.
She wants 14 candies, so she is going to pay 14 candies * $0.50/candy = $<<14*0.5=7>>7 for them.
She wants also 20 jellybeans, and they're going to cost 20 jellybeans * $0.20/jellybean = $<<20*0.2=4>>4.
So after the purchase, she will be left with $22 - $4 - $7 = $<<22-4-7=11>>11.
#### 11

Question: Faith's neighborhood, with a total of 20 homes, decided to install solar panels. Each home needed 10 panels capable of providing their power needs. The supplier of the panels brought 50 panels less than the required amount. The neighbors agreed to only install the panels up to where they'd be finished. How many homes had their panels installed?
Answer: The total number of panels required is 20*10 = <<20*10=200>>200 panels.
When 50 failed to be delivered, the total number available for use became 200-50 = <<200-50=150>>150 panels.
If each home requires 10 panels, the number of homes that had panels installed is 150/10 = <<150/10=15>>15 homes
#### 15

Question: Jenna wants to buy a concert ticket that costs $181, plus five drink tickets for $7 each. If Jenna earns $18 an hour and works 30 hours a week, what percentage of her monthly salary will she spend on this outing?
Answer: First find the total cost of the drink tickets: 5 tickets * $7/ticket = $<<5*7=35>>35
Then add that cost to the cost of the ticket to find the total cost: $35 + $181 = $<<35+181=216>>216
Then multiply Jenna's hourly rate by the number of hours she works each week to find her weekly earnings: $18/hour * 30 hours/week = $<<18*30=540>>540/week
Then multiply her weekly earnings by the number of weeks she works each month: $540/week * 4 weeks/month = $<<540*4=2160>>2160/month
Then divide the cost of the concert by Jenna's monthly earnings and multiply by 100% to express the answer as a percentage: $216 / $2160 * 100% = 10%
#### 10

Question: Billy sells DVDs. He has 8 customers on Tuesday. His first 3 customers buy one DVD each. His next 2 customers buy 2 DVDs each. His last 3 customers don't buy any DVDs. How many DVDs did Billy sell on Tuesday?
Answer:

Generated output:

Billy and the same time to the same time to the same time to the same time ...

(repeated endlessly)

This clearly indicates something is fundamentally broken in the generation pipeline.

Confirming the issue outside `lm_eval`

This behavior is not tied to lm_eval. Running the same gsm8k prompt with transformers’ generate() produces the same looping output (truncated here to 256 tokens):

model_name = "recursal/QRWKV6-7B-Base"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    torch_dtype=torch.float32,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
eos = tokenizer.convert_tokens_to_ids("<|endoftext|>")

x = tokenizer(prompt, return_tensors="pt").to(model.device)
y = model.generate(**x, max_new_tokens=256, do_sample=False, eos_token_id=eos, pad_token_id=eos)
print(tokenizer.decode(y[0][x["input_ids"].shape[1]:], skip_special_tokens=True))

Output:

Billy's the number of the number of the number of the number ...

(same issue as above)

Comparison with QRWKV7

For comparison, the QRWKV7-7B-Instruct model does not suffer from this issue. Running gsm8k works correctly:

lm_eval --model hf \
    --model_args pretrained=recursal/QRWKV7-7B-Instruct \
    --trust_remote_code \
    --tasks gsm8k \
    --device cuda:0 \
    --batch_size 8 \
    --limit 5

Results:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	1.0	±	0.0
		strict-match	5	exact_match	↑	0.8	±	0.2

Conclusion

Something is still fundamentally wrong with QRWKV6 models when running generation tasks. Non-generation evaluations are fine, but text generation degenerates into meaningless repetition, both inside and outside lm_eval. By contrast, QRWKV7 models work correctly.

SmerkyG

recursal org Sep 17

Thanks for noticing that - there were indeed a few problems, mostly related to Transformers but one was an argument order switcheroo for the cache from that last fix. Please let me know if this works for you.

davidstap

Sep 18

Thanks for the latest fix!

The output seems correct now. For the same prompt the model now outputs:

 Billy sold 3 DVDs to his first 3 customers because 1 * 3 = 3
He sold 2 DVDs to his next 2 customers because 2 * 2 = 4
He sold 6 DVDs to his customers on Tuesday because 3 + 2 + 1 = 6
The answer is 6

That looks like okay output (regardless of the wrong answer). Can you confirm this looks ok?

SmerkyG

recursal org Sep 18

•

edited Sep 18

I tried gsm8k on the 72B model and it got scores I would expect. These models, especially the smaller ones, aren't that great at gsm8k, which is something we've been working on addressing with further research since the paper.

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5469|±  |0.0312|
|     |       |strict-match    |     5|exact_match|↑  |0.2969|±  |0.0286|

Hopefully that's it - please feel free to reopen or create a new issue if you encounter further problems.

SmerkyG changed discussion status to closed Sep 18

Model produces gibberish

Non-generation tasks work fine

Generation tasks fail

Example of faulty generation

Confirming the issue outside lm_eval

Comparison with QRWKV7

Conclusion

Confirming the issue outside `lm_eval`