Model Card: Intel/gpt-oss-20b-int4-g64-rtn-AutoRound

Model Details

Model Name: Intel/gpt-oss-20b-int4-g64-rtn-AutoRound
Developer: Intel, based on OpenAI's gpt-oss-20b
Release Date: Not explicitly stated in available information
Model Type: Mixed INT4 language model with symmetric quantization
Base Model: OpenAI/gpt-oss-20b
Quantization: 4-bit integer (INT4) with group size 64, using Intel's AutoRound via Round-To-Nearest (RTN) without algorithm tuning
License: Apache 2.0
Model Size: Approximately 1.8 billion parameters (quantized)
Tensor Types: I32, BF16, F16
Non-Expert Layers: Fallback to 16-bit precision (BF16/F16)

This model is a quantized version of OpenAI's gpt-oss-20b, optimized for efficient inference on various hardware, including CPUs, Intel GPUs, and CUDA-enabled GPUs. It is designed for lower latency and specialized use cases, leveraging a Mixture-of-Experts (MoE) architecture with approximately 20 billion total parameters, of which about 3.6 billion are active per inference pass.

Intended Use

Primary Use Cases:
- Local inference on consumer-grade hardware (e.g., desktops, laptops)
- Specialized tasks requiring low-latency text generation
- Research and experimentation in natural language processing
- Agentic workflows with strong instruction following, tool use (e.g., web search, Python code execution), and reasoning capabilities
Supported Tasks:
- Text generation
- Instruction following
- Chain-of-thought reasoning
- Structured outputs
Intended Users:
- Developers and researchers
- Enterprises building AI applications
- Hardware enthusiasts testing local inference performance

The model is suitable for scenarios requiring efficient deployment on resource-constrained devices, such as those with 16GB of memory. It supports a context window of up to 131,072 tokens, with a recommended minimum of 16,384 for reasoning tasks.

How to Use

Inference with Transformers

from transformers import pipeline

model_id = "Intel/gpt-oss-20b-int4-g64-rtn-AutoRound"
pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)
messages = [{"role": "user", "content": "Explain quantum mechanics clearly and concisely."}]
outputs = pipe(messages, max_new_tokens=512)
print(outputs[0]["generated_text"][-1])

Hardware Requirements

Minimum: 16GB VRAM for local inference (e.g., NVIDIA RTX 3090)
Recommended: Single 80GB GPU (e.g., NVIDIA H100, AMD MI300X) for optimal performance
Tested Platforms:
- Windows 11: Up to 36,000-token context with 24GB VRAM (RTX 3090)
- Linux: Up to 52,000-token context with 24GB VRAM (RTX 3090)
Performance (on RTX 3090, MXFP4 format):
- Windows: ~24–36 tokens/second (t/s) generation at 2,000–36,000 token context
- Linux: ~55–114 t/s generation at 2,000–50,000 token context

Linux setups typically offer better performance due to lower VRAM overhead.

Ethical Considerations and Limitations

Limitations:
- The model may produce factually incorrect outputs and should not be relied upon for factual accuracy without verification.
- Potential for generating biased, lewd, or offensive content due to limitations in the pretrained model and fine-tuning datasets.
- Quantization may slightly degrade performance compared to the full-precision model.
Ethical Considerations:
- Developers should perform safety testing before deployment to mitigate risks of harmful outputs.
- Users should be informed of the model’s limitations and potential biases.
- The model’s open-weight nature allows fine-tuning, which could be misused to bypass safety mechanisms.

Consult legal advice before using the model for commercial purposes.

Training and Quantization Details

Base Model: OpenAI/gpt-oss-20b, a Mixture-of-Experts model with 20 billion total parameters (~3.6 billion active per inference).
Quantization Method: Intel’s AutoRound with RTN (no algorithm tuning), using group size 64 and symmetric quantization for INT4 precision.
Weight Precision:
- MoE projection weights: MXFP4 (4.25 bits per parameter)
- Non-expert layers: BF16/F16 (16-bit)
Training Data: Not disclosed in available information.
Quantization Benefits: Reduces memory footprint, enabling deployment on systems with as little as 16GB of memory.

The model leverages Intel’s Neural Compressor for optimization. For more details, see Intel’s documentation.

Evaluation

Performance Metrics: The model has been tested for inference speed on consumer hardware (e.g., RTX 3090), showing competitive token generation rates (see Hardware Requirements).
Safety Evaluations: Based on OpenAI’s evaluations of gpt-oss-20b, the model does not reach high-risk capability thresholds in Biological, Chemical, Cyber, or AI Self-Improvement categories, even with adversarial fine-tuning.

Citation

@article{cheng2023optimize,
  title={Optimize weight rounding via signed gradient descent for the quantization of llms},
  author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},
  journal={arXiv preprint arXiv:2309.05516},
  year={2023}
}