OpenAI just dropped two massive open-weight models — but how do we separate the reality from the hype?

Community Article Published August 9, 2025

Last week, one of the most powerful earthquakes ever recorded shook the seabed off Russia’s coast, sending tsunami alerts rippling across the Pacific. In the end, the waves never came. This week, the Generative AI world felt its own tectonic jolt: OpenAI dropped two open-weight foundation models — gpt-oss-20b and gpt-oss-120b — breaking from its long-standing closed-weight tradition. The open-source community was stunned. But like that earthquake, the real question is: will this trigger a tidal wave of change, or fade into a tremor in the history books?

The TL;DR of OpenAI's release:

  • Released under an Apache 2.0 license
  • Language only instruct + reasoning models
  • Mixture-of-experts architecture
  • Comparable in performance to OpenAI o4-mini (120b) and o3-mini (20b)
  • Novel aspects like MXFP4 quantization and harmony prompt format
  • Focus on safety

But is this significant?

image/png

Given the high-profile nature of OpenAI, it is natural that much hype followed the release. Opinions on the HuggingFace model page and Hacker News were mixed as to the models’ “performance.” Here at Oumi, we wanted to take a more cool-headed approach and set about separating the signal from the noise in a principled way.

As a fully open-source platform for training, evaluating, and deploying frontier AI, we were curious how these models would perform in our LLM-as-a-judge evaluation suite—a fast, flexible system that lets you measure model quality across aspects like truthfulness, instruction-following, safety, and topic adherence.

And because Oumi is built to support any backend, we plugged straight into Together.ai’s inference API on day one of release.

LLM-as-a-Judge with Oumi

image/png

Oumi has both a no-code CLI and a low-code Python API (as well as 100% customization for power-users). Here we’ll use the CLI along with some simple Python code between steps.

First, we need to prepare our data for evaluation. We use an interesting dataset called fka/awesome-chatgpt-prompts containing 203 prompts that are useful for judging. Here is an example data point:

Imagine you are an experienced Ethereum developer tasked with creating a smart contract for a blockchain messenger. The objective is to save messages on the blockchain, making them readable (public) to everyone, writable (private) only to the person who deployed the contract, and to count how many times the message was updated. Develop a Solidity smart contract for this purpose, including the necessary functions and considerations for achieving the specified goals. Please provide the code and any relevant explanations to ensure a clear understanding of the implementation.

We download the dataset from the HuggingFace Hub and convert it to a format compatible with Oumi:

from datasets import load_dataset
import json

def convert_to_oumi_format(s):
   return {
       'messages': [
           {'role': 'system', 'content': 'You are a helpful assistant.'},
           {'role': 'user', 'content': s['prompt']}
       ]
   }

hf_dataset = load_dataset('fka/awesome-chatgpt-prompts', split='train') \
               .remove_columns('act') \
               .map(convert_to_oumi_format, remove_columns='prompt')

with open('awesome-chatgpt-prompts.jsonl', 'w', encoding='utf-8') as file:
   for item in hf_dataset:
       json_line = json.dumps(item, ensure_ascii=False)
       file.write(json_line + '\n')

Oumi allows you to perform the various steps of foundation model development using .yaml configurations. So next, let's define our configuration for inference:

model:
  model_name: "openai/gpt-oss-120b"
                                
remote_params:
  num_workers: 32 # max number of workers to run in parallel
  politeness_policy: 60 # wait 60 seconds before sending next request

engine: TOGETHER

Oumi provides a unified interface to inference across local (Transformers, vLLM), hosted (OpenAI, Together, Lambda.ai, Anthropic, etc.), and cloud services (AWS, GCP, Azure, etc.) including built-in features to adapt to available connection bandwidth and resume from partially complete results.

With this, a simple terminal command collects the prompt completions for gpt-oss-120b:

oumi infer \
  --config inference-gpt-oss.yaml \
  --input_path awesome-chatgpt-prompts.jsonl
  --output_path completions.jsonl

Now, a small data conversion into the format that oumi judge uses:

def convert_to_oumi_format(s):
   return {
       'request': s['messages'][1]['content'],
       'response': s['messages'][2]['content'],
   }

hf_dataset = load_dataset('json', data_files='completions.jsonl', split='train') \
           .map(convert_to_oumi_format, remove_columns=['conversation_id', 'messages', 'metadata'])

with open('completions-reformatted.jsonl', 'w', encoding='utf-8') as file:
   for item in hf_dataset:
       json_line = json.dumps(item, ensure_ascii=False)
       file.write(json_line + '\n')

And we can evaluate one of the built-in LLM-as-a-Judge metrics with:

oumi judge dataset \
  --config truthfulness
  --input completions-reformatted.json
  --output judgements.jsonl

All that’s left is to repeat for the other judge metrics and models.

The results are in

image/png

Here are the results. Some quick takeaways:

  • Truthfulness and Safety are impressive—especially in the 120B model, which seems tuned to avoid harmful output (although we note that the safety results for all models are statistically equivalent).

  • Instruction Following, however, lags behind expectations (see Qwen3 Instruct results). Why? Oumi’s LLM-as-a-judge explanations revealed a pattern: the models frequently refuse to answer harmless instructions. Our hypothesis? OpenAI likely optimized heavily for safety, perhaps too heavily, leading to overly conservative refusals in legitimate cases.

This is exactly the kind of nuanced insight that Oumi makes possible.

What's next?

image/png

With every new model release comes a wave of hype. But, are these models actually any “good”? Oumi, our completely open-source foundation model platform, can assist:

  • Cut through the noise with structured, reliable, and automated evaluations
  • Produce LLM-generated explanations that reveal not just what failed, but why
  • Evaluate your own criteria---not just what the benchmark leaderboard says
  • Stay flexible—evaluate models hosted anywhere—from remote APIs to your own cluster or even your laptop
  • Contribute: Oumi is 100% open source and backed by a growing community of researchers, developers, and institutions working together to make AI more transparent and collaborative. In fact, @clem is one of our angel investors!

In the coming days, we’re rolling out inference for these models, which you can deploy on your own GPU cluster, giving you even more ways to evaluate and compare models—open or closed. And on Aug 14, 2025, we’ll be running a short webinar: “gpt-oss: separating the substance from the hype,” where I explain novel features of these models like the quantization scheme, as well as how we produced the results above. Sign up here: https://lu.ma/qd9fhau9

If you're building, testing, or researching frontier models, come join the movement. Oumi is open-source, community-driven, and built for people who want to make AI better through transparency and collaboration. Try it out, or contribute at oumi.ai!

Community

Sign up or log in to comment