Text Generation
Safetensors
English
Chinese

Model Card for MiniModel-200M-Base

This model card provides an overview of MiniModel-200M-Base, a highly efficient 200M-parameter decoder-only transformer trained with state-of-the-art techniques for maximum data and compute efficiency.

Model Details

Model Description

  • Developed by: xTimeCrystal
  • Model type: Softmax self-attention decoder-only transformer
  • Languages: English, Chinese, Python
  • License: Apache 2.0

This model leverages cutting-edge training techniques to achieve strong performance with only 10B tokens of training data, trained in just one day on a single RTX 5090 GPU. As demonstrated below, it handles diverse tasks—from factual recall to coherent article generation—despite its small size.

Key innovations include:

  • Adaptive Muon optimizer: Based on the Muon optimizer, it delivers 2.1× the data efficiency of AdamW. Momentum buffers are stored in bfloat16, further reducing VRAM usage.
  • Aggressive data filtering: A curated selection of high-quality educational content enhances performance in resource-constrained settings.
  • Efficient data bin-packing: To minimize padding waste (originally >70%), sequences were concatenated via a bin-packing algorithm to reach near-full 2048-token lengths, reducing padding to <5%.
  • Float8 pretraining: Training used bfloat16 master weights, fp8 (e4m3) casting with bfloat16 accumulation, and full bfloat16 backward passes. The attention mechanism was kept in bfloat16 to avoid loss degradation. This setup matches full bfloat16 performance while cutting VRAM usage by ~30% and boosting throughput by ~20%.
  • ReLU² activation: This ultra-sparse activation outperforms SwiGLU (1, 2) while requiring only two matrix multiplications, marginally improving VRAM usage.
  • Full attention: All layers use standard softmax attention (no sliding window or grouped-query attention), preserving capacity in a small model.
  • QK Norm without scalars: Removing learnable scalars improved training stability by preventing loss spikes and excessive attention activations.

These optimizations enabled lossless training for 110k steps with a massive batch size of 64 × 2048 tokens without gradient accumulation while staying under 30 GB VRAM and remaining completely spike-free:

Training loss curve

Intended Uses

This model is designed for efficient inference and experimentation in low-resource environments. It is suitable for educational use, prototyping, and applications where model size and speed are critical. Users include researchers, developers, and hobbyists working with constrained hardware.

Getting Started

Download all files from the repository into a single folder and run the notebook cells.

Loading the Model

import json
import torch
from safetensors import safe_open
from model import Transformer as Model
from transformers import PreTrainedTokenizerFast

with open("./config.json", "r") as f:
    config = json.load(f)

device = "cuda" if torch.cuda.is_available() else "cpu"
torch.set_default_device(device)

model = Model(**config)
model.zero_grad()
model.bfloat16()

saved_states = {}
with safe_open("./model.safetensors", framework="pt", device=device) as f:
    for key in f.keys():
        saved_states[key] = f.get_tensor(key)
model.load_state_dict(saved_states)
model.eval()

tokenizer = PreTrainedTokenizerFast.from_pretrained("./")

Example: Fibonacci Generation

tokens = tokenizer('''def fibonacci(n: int):''')['input_ids']
current = tokenizer.decode(tokens)
print(current, end="")

temperature = 1e-4
for _ in range(128):
    tok = torch.tensor(tokens).reshape(1, -1)
    logits = model(tok)
    nxt = torch.multinomial(
        torch.softmax(logits[:, -1].float() / temperature, dim=-1).squeeze(),
        num_samples=1
    ).item()
    tokens += [nxt]
    print(tokenizer.decode(tokens).replace(current, "", 1), end="")
    current = tokenizer.decode(tokens)

Output:

<s> def fibonacci(n: int):
    if n < 2:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)

def fibonacci_recursive(n: int):
    if n < 2:
        return n
    return fibonacci_recursive(n - 1) + fibonacci_recursive(n - 2)

def fibonacci_iterative(n: int):
    if n < 2:
        return n
    return fibonacci_iterative

Additional Examples

  • Digits of π (temperature=0.0001):
    Correctly recites the first 20 digits: 3.14159265358979323846...

  • “The purpose of life” (temperature=0.8):
    Produces a coherent, skill-focused philosophical reflection:

    <s> The purpose of life is to build up the body’s strength, endurance, and energy reserves through the accumulation of acquired skills, and to get rid of worn or damaged parts of the body. All of this depends on day’s activities and deeds. The process of building up the body and taking on new challenges, such as accumulating health, will require the use of skills and abilities.
    The main purpose of building up skills and abilities in life is to make new people capable of doing the things that they need to do. This process requires you to develop skills that are applicable to everyday life. Skills can either be formal, or in the
    

Additional examples can be found in the Jupyter notebook file.

Tip: Increase temperature to reduce repetition and encourage creativity. Temperature = 0.8 is recommended for general use, while temperature = 0.0001 is recommended for factual recall.

Bias, Risks, and Limitations

Despite strong performance in many areas, this 200M-parameter model is not infallible. For instance, when prompted with “The radius of the Earth”, it outputs:

<s> The radius of the Earth is a measure of almost exactly 375,000 miles.
Scientists have long wondered what the planet was like long ago. Because of how old the Earth is—that is, the oldest part of it—we know that the Earth’s radius is about 670,000 miles. ...

This is off by roughly two orders of magnitude (actual mean radius: ~3,959 miles). Users should verify all factual claims and avoid relying on the model for high-stakes decisions.

Citation

If you use this model in research, please cite:

@misc{timecrystal200m2024,
  title={MiniModel-200M-Base: SOTA Efficiency for Small Language Models},
  author={xTimeCrystal},
  year={2025},
  howpublished={\url{https://huggingface.co/xTimeCrystal/MiniModel-200M-Base}},
}
Downloads last month
294
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 2 Ask for provider support

Datasets used to train xTimeCrystal/MiniModel-200M-Base