Qwen3-Reranker-4B-seq-cls-vllm-fixed
This is a fixed version of the Qwen3-Reranker-4B model converted to sequence classification format, optimized for use with vLLM.
Model Description
This model is a pre-converted version of Qwen/Qwen3-Reranker-4B that:
- Has been converted from CausalLM to SequenceClassification architecture
- Includes proper configuration for vLLM compatibility
- Provides ~75,000x reduction in classification head size
- Offers ~150,000x fewer operations per token compared to using the full LM head
Key Improvements
The original converted model (tomaarsen/Qwen3-Reranker-4B-seq-cls) was missing critical vLLM configuration attributes. This version adds:
{
"classifier_from_token": ["no", "yes"],
"method": "from_2_way_softmax",
"use_pad_token": false,
"is_original_qwen3_reranker": false
}
These configurations are essential for vLLM to properly handle the pre-converted weights.
Usage with vLLM
vllm serve danielchalef/Qwen3-Reranker-4B-seq-cls-vllm-fixed \
--task score \
--served-model-name qwen3-reranker-4b \
--disable-log-requests
Python Example
from vllm import LLM
llm = LLM(
model="danielchalef/Qwen3-Reranker-4B-seq-cls-vllm-fixed",
task="score"
)
queries = ["What is the capital of France?"]
documents = ["Paris is the capital of France."]
outputs = llm.score(queries, documents)
scores = [output.outputs.score for output in outputs]
print(scores)
Performance
This model performs identically to the original Qwen3-Reranker-4B when used with proper configuration, while providing significant efficiency improvements:
- Memory: ~600MB โ ~8KB for classification head
- Compute: 151,936 logits โ 1 logit per forward pass
- Speed: Faster inference due to reduced computation
Technical Details
- Architecture: Qwen3ForSequenceClassification
- Base Model: Qwen/Qwen3-Reranker-4B
- Conversion Method: from_2_way_softmax (yes_logit - no_logit)
- Model Size: 4B parameters
- Task: Reranking/Scoring
Citation
If you use this model, please cite the original Qwen3-Reranker:
@misc{qwen3reranker2024,
title={Qwen3-Reranker},
author={Qwen Team},
year={2024},
publisher={Hugging Face}
}
License
Apache 2.0 (inherited from the base model)
- Downloads last month
- -
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support