YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Model Card: Intel/gpt-oss-20b-int4-g64-rtn-AutoRound

Model Details

  • Model Name: Intel/gpt-oss-20b-int4-g64-rtn-AutoRound
  • Developer: Intel, based on OpenAI's gpt-oss-20b
  • Release Date: Not explicitly stated in available information
  • Model Type: Mixed INT4 language model with symmetric quantization
  • Base Model: OpenAI/gpt-oss-20b
  • Quantization: 4-bit integer (INT4) with group size 64, using Intel's AutoRound via Round-To-Nearest (RTN) without algorithm tuning
  • License: Apache 2.0
  • Model Size: Approximately 1.8 billion parameters (quantized)
  • Tensor Types: I32, BF16, F16
  • Non-Expert Layers: Fallback to 16-bit precision (BF16/F16)

This model is a quantized version of OpenAI's gpt-oss-20b, optimized for efficient inference on various hardware, including CPUs, Intel GPUs, and CUDA-enabled GPUs. It is designed for lower latency and specialized use cases, leveraging a Mixture-of-Experts (MoE) architecture with approximately 20 billion total parameters, of which about 3.6 billion are active per inference pass.

Intended Use

  • Primary Use Cases:
    • Local inference on consumer-grade hardware (e.g., desktops, laptops)
    • Specialized tasks requiring low-latency text generation
    • Research and experimentation in natural language processing
    • Agentic workflows with strong instruction following, tool use (e.g., web search, Python code execution), and reasoning capabilities
  • Supported Tasks:
    • Text generation
    • Instruction following
    • Chain-of-thought reasoning
    • Structured outputs
  • Intended Users:
    • Developers and researchers
    • Enterprises building AI applications
    • Hardware enthusiasts testing local inference performance

The model is suitable for scenarios requiring efficient deployment on resource-constrained devices, such as those with 16GB of memory. It supports a context window of up to 131,072 tokens, with a recommended minimum of 16,384 for reasoning tasks.

How to Use

Inference with Transformers

from transformers import pipeline

model_id = "Intel/gpt-oss-20b-int4-g64-rtn-AutoRound"
pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)
messages = [{"role": "user", "content": "Explain quantum mechanics clearly and concisely."}]
outputs = pipe(messages, max_new_tokens=512)
print(outputs[0]["generated_text"][-1])

Hardware Requirements

  • Minimum: 16GB VRAM for local inference (e.g., NVIDIA RTX 3090)
  • Recommended: Single 80GB GPU (e.g., NVIDIA H100, AMD MI300X) for optimal performance
  • Tested Platforms:
    • Windows 11: Up to 36,000-token context with 24GB VRAM (RTX 3090)
    • Linux: Up to 52,000-token context with 24GB VRAM (RTX 3090)
  • Performance (on RTX 3090, MXFP4 format):
    • Windows: ~24–36 tokens/second (t/s) generation at 2,000–36,000 token context
    • Linux: ~55–114 t/s generation at 2,000–50,000 token context

Linux setups typically offer better performance due to lower VRAM overhead.

Ethical Considerations and Limitations

  • Limitations:
    • The model may produce factually incorrect outputs and should not be relied upon for factual accuracy without verification.
    • Potential for generating biased, lewd, or offensive content due to limitations in the pretrained model and fine-tuning datasets.
    • Quantization may slightly degrade performance compared to the full-precision model.
  • Ethical Considerations:
    • Developers should perform safety testing before deployment to mitigate risks of harmful outputs.
    • Users should be informed of the model’s limitations and potential biases.
    • The model’s open-weight nature allows fine-tuning, which could be misused to bypass safety mechanisms.

Consult legal advice before using the model for commercial purposes.

Training and Quantization Details

  • Base Model: OpenAI/gpt-oss-20b, a Mixture-of-Experts model with 20 billion total parameters (~3.6 billion active per inference).
  • Quantization Method: Intel’s AutoRound with RTN (no algorithm tuning), using group size 64 and symmetric quantization for INT4 precision.
  • Weight Precision:
    • MoE projection weights: MXFP4 (4.25 bits per parameter)
    • Non-expert layers: BF16/F16 (16-bit)
  • Training Data: Not disclosed in available information.
  • Quantization Benefits: Reduces memory footprint, enabling deployment on systems with as little as 16GB of memory.

The model leverages Intel’s Neural Compressor for optimization. For more details, see Intel’s documentation.

Evaluation

  • Performance Metrics: The model has been tested for inference speed on consumer hardware (e.g., RTX 3090), showing competitive token generation rates (see Hardware Requirements).
  • Safety Evaluations: Based on OpenAI’s evaluations of gpt-oss-20b, the model does not reach high-risk capability thresholds in Biological, Chemical, Cyber, or AI Self-Improvement categories, even with adversarial fine-tuning.

Citation

@article{cheng2023optimize,
  title={Optimize weight rounding via signed gradient descent for the quantization of llms},
  author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},
  journal={arXiv preprint arXiv:2309.05516},
  year={2023}
}
Downloads last month
1,824
Safetensors
Model size
1.8B params
Tensor type
I32
·
BF16
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support