2imi9
/

gpt-oss-20B-NVFP4A16-BF16

@@ -52,11 +52,6 @@ model-index:
 This model is a quantized version of OpenAI's GPT-OSS-20B using NVIDIA's advanced NVFP4 format. It follows the official NVIDIA TensorRT Model Optimizer methodology, providing superior accuracy retention compared to MXFP4 quantization while maintaining significant memory efficiency gains.
-## Blog
-Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training:
-https://developer.nvidia.com/blog/fine-tuning-gpt-oss-for-accuracy-and-performance-with-quantization-aware-training/
 ## Key Features
 - **Advanced Quantization**: Uses NVFP4 format with FP8 E4M3 scaling for enhanced precision
@@ -207,6 +202,12 @@ This approach can achieve up to 98% task-specific performance recovery.
 - **Memory**: 24GB+ VRAM recommended
 - **CUDA**: 12.0+
 ### Framework Support Status
 - **TensorRT-LLM**: NVFP4 support in active development
@@ -216,10 +217,13 @@ This approach can achieve up to 98% task-specific performance recovery.
 ## Model Format Details
-- **Storage Format**: "Fake-quantized" NVFP4 (BF16 representation with quantization metadata)
-- **Deployment Format**: True NVFP4 when used with compatible inference engines
-- **File Format**: SafeTensors with quantization configuration
-- **Size**: ~39GB (fake-quantized), ~10GB (deployed NVFP4)
 ## Use Cases
@@ -235,6 +239,23 @@ This approach can achieve up to 98% task-specific performance recovery.
 - **Memory**: ~75% reduction in deployment memory requirements
 - **Compatibility**: Works with standard transformers, optimized for NVIDIA frameworks
 ## License
 This model inherits the Apache 2.0 license from the base openai/gpt-oss-20b model. Commercial use is permitted under the same terms.

 This model is a quantized version of OpenAI's GPT-OSS-20B using NVIDIA's advanced NVFP4 format. It follows the official NVIDIA TensorRT Model Optimizer methodology, providing superior accuracy retention compared to MXFP4 quantization while maintaining significant memory efficiency gains.
 ## Key Features
 - **Advanced Quantization**: Uses NVFP4 format with FP8 E4M3 scaling for enhanced precision
 - **Memory**: 24GB+ VRAM recommended
 - **CUDA**: 12.0+
+### Compatible Hardware (Software Emulation)
+- **RTX 4090**: Ada Lovelace architecture (no native NVFP4 acceleration)
+- **RTX 4080/4070**: Compatible via software emulation
+- **Data Center**: H100, A100 (software emulation)
+- **Memory**: 20GB+ VRAM for model loading
 ### Framework Support Status
 - **TensorRT-LLM**: NVFP4 support in active development
 ## Model Format Details
+- **Storage Format**: BF16 with NVFP4 quantization metadata
+- **File Size**: ~39GB (BF16 precision with quantization instructions)
+- **Deployment Format**: Runtime conversion to NVFP4 by compatible inference engines
+- **Deployed Size**: ~10GB when converted to 4-bit NVFP4 format
+- **File Format**: SafeTensors with embedded quantization configuration
+This model contains the full BF16 weights along with quantization parameters that enable inference engines like TensorRT-LLM to convert weights to true 4-bit NVFP4 format during model loading. The memory savings and performance benefits are realized at inference time, not during storage.
 ## Use Cases
 - **Memory**: ~75% reduction in deployment memory requirements
 - **Compatibility**: Works with standard transformers, optimized for NVIDIA frameworks
+## Limitations and Considerations
+- **Current State**: Model saved in fake-quantized format for compatibility
+- **Real Benefits**: Achieved only when deployed with NVFP4-compatible engines
+- **Hardware Dependency**: Optimal performance requires NVIDIA Blackwell architecture
+- **Framework Support**: Limited until inference engines implement NVFP4 support
+- **Model Size**: Large storage footprint until deployment conversion
+## Evaluation and Benchmarking
+This model maintains the capabilities of the original GPT-OSS-20B while providing memory efficiency benefits. For comprehensive evaluation, test against:
+- **Language Modeling**: Perplexity on standard datasets
+- **Downstream Tasks**: Task-specific accuracy measurements
+- **Generation Quality**: Human evaluation of output coherence
+- **Memory Usage**: Deployment memory requirements vs. accuracy trade-offs
 ## License
 This model inherits the Apache 2.0 license from the base openai/gpt-oss-20b model. Commercial use is permitted under the same terms.