YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

A0l-8B-PRUNE

Pruned version of schneewolflabs/A0l-12B

This is a depth-pruned variant of the A0l-12B model, reduced from 12.25B to 8.70B parameters using structured layer removal.

Model Details

Base Model: schneewolflabs/A0l-12B
Architecture: Mistral-Nemo derivative
Pruning Method: Depth pruning (layer removal)
Original Parameters: 12.25B
Pruned Parameters: 8.70B
Reduction: 28.9%
Layers: 40 → 27 (removed 13 middle layers)

Pruning Details

What Changed

Removed: 13 transformer layers (layers 13-25)
Kept: Early layers (feature extraction) + late layers (task-specific)
Preserved:
- ✅ Vocabulary size (128k tokens)
- ✅ Hidden dimensions (5120)
- ✅ FFN dimensions (14336)
- ✅ Attention structure (32 heads, 8 KV heads)
- ✅ SwiGLU activation
- ✅ Same tokenizer

What's Maintained

This model maintains full compatibility with the Mistral-Nemo architecture:

Same vocabulary and tokenizer
Same hidden size and FFN dimensions
Grouped Query Attention (GQA) with 4:1 ratio
Rotary Position Embeddings (RoPE) with theta=1M
BFloat16 precision

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "nbeerbower/A0l-8B-PRUNE",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("nbeerbower/A0l-8B-PRUNE")

# Generate text
prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.7,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Performance

Model Size

Original: ~24GB (12.25B parameters)
Pruned: ~17GB (8.70B parameters)
Savings: ~29% smaller

Expected Benefits

Inference Speed: ~1.4x faster (fewer layers to compute)
Memory: ~29% less VRAM required
Compatibility: Works with all HuggingFace tools (VLLM, TGI, etc.)

Quality Considerations

⚠️ Important: This model was pruned without knowledge distillation or retraining. Output quality is degraded compared to the original A0l-12B:

Text may be less coherent
May produce grammatical errors or artifacts
Suitable for applications where speed/size matter more than perfect quality

For best results: Consider this a "base" pruned model that should be fine-tuned or distilled on your target task/dataset.

Recommended Use Cases

Good for:

Resource-constrained deployments
Experimentation and research
Base model for further fine-tuning
Applications where speed > quality

Not recommended for:

Production chatbots without further training
High-stakes text generation
Tasks requiring perfect coherence

How to Improve Quality

Knowledge Distillation: Use original A0l-12B as teacher, train for 1-2 epochs
Fine-tuning: Train on your specific task/domain
Try Conservative Pruning: Remove fewer layers (8 instead of 13)

Technical Details

Pruning Configuration

{
  "original_model": "schneewolflabs/A0l-12B",
  "original_params": 12247782400,
  "pruned_params": 8703462400,
  "reduction_percent": 28.93,
  "pruning_type": "depth",
  "layers_removed": 13,
  "removed_layer_indices": "13-25"
}

Architecture Comparison

Component	Original	Pruned	Status
Layers	40	27	⚠️ Changed
Hidden Size	5120	5120	✅ Same
Intermediate Size	14336	14336	✅ Same
Attention Heads	32	32	✅ Same
KV Heads	8	8	✅ Same
Vocab Size	131072	131072	✅ Same

Pruning Methodology

This model was pruned using structured depth pruning:

Identified redundant middle layers (13-25)
Removed entire transformer blocks
Preserved architectural integrity
No retraining applied (zero-shot pruning)

Tools used: Torch-Pruning

Limitations

Output quality degraded vs. original (no distillation applied)
May produce incoherent text on complex prompts
Not suitable for production without further training
Intended as a research/development artifact

Citation

If you use this model, please cite the original A0l-12B:

Original model: schneewolflabs/A0l-12B
Base architecture: Mistral-Nemo-12B
Pruning method: Depth pruning (layer removal)

License

Follows the same license as the original A0l-12B model.

Acknowledgments

Original model: schneewolflabs/A0l-12B
Base architecture: Mistral-Nemo-12B
Pruning library: Torch-Pruning

Downloads last month: 10

Safetensors

Model size

9B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nbeerbower/A0l-8B-PRUNE

Quantizations

2 models