YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

A0l-8B-PRUNE

Pruned version of schneewolflabs/A0l-12B

This is a depth-pruned variant of the A0l-12B model, reduced from 12.25B to 8.70B parameters using structured layer removal.

Model Details

  • Base Model: schneewolflabs/A0l-12B
  • Architecture: Mistral-Nemo derivative
  • Pruning Method: Depth pruning (layer removal)
  • Original Parameters: 12.25B
  • Pruned Parameters: 8.70B
  • Reduction: 28.9%
  • Layers: 40 β†’ 27 (removed 13 middle layers)

Pruning Details

What Changed

  • Removed: 13 transformer layers (layers 13-25)
  • Kept: Early layers (feature extraction) + late layers (task-specific)
  • Preserved:
    • βœ… Vocabulary size (128k tokens)
    • βœ… Hidden dimensions (5120)
    • βœ… FFN dimensions (14336)
    • βœ… Attention structure (32 heads, 8 KV heads)
    • βœ… SwiGLU activation
    • βœ… Same tokenizer

What's Maintained

This model maintains full compatibility with the Mistral-Nemo architecture:

  • Same vocabulary and tokenizer
  • Same hidden size and FFN dimensions
  • Grouped Query Attention (GQA) with 4:1 ratio
  • Rotary Position Embeddings (RoPE) with theta=1M
  • BFloat16 precision

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "nbeerbower/A0l-8B-PRUNE",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("nbeerbower/A0l-8B-PRUNE")

# Generate text
prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.7,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Performance

Model Size

  • Original: ~24GB (12.25B parameters)
  • Pruned: ~17GB (8.70B parameters)
  • Savings: ~29% smaller

Expected Benefits

  • Inference Speed: ~1.4x faster (fewer layers to compute)
  • Memory: ~29% less VRAM required
  • Compatibility: Works with all HuggingFace tools (VLLM, TGI, etc.)

Quality Considerations

⚠️ Important: This model was pruned without knowledge distillation or retraining. Output quality is degraded compared to the original A0l-12B:

  • Text may be less coherent
  • May produce grammatical errors or artifacts
  • Suitable for applications where speed/size matter more than perfect quality

For best results: Consider this a "base" pruned model that should be fine-tuned or distilled on your target task/dataset.

Recommended Use Cases

Good for:

  • Resource-constrained deployments
  • Experimentation and research
  • Base model for further fine-tuning
  • Applications where speed > quality

Not recommended for:

  • Production chatbots without further training
  • High-stakes text generation
  • Tasks requiring perfect coherence

How to Improve Quality

  1. Knowledge Distillation: Use original A0l-12B as teacher, train for 1-2 epochs
  2. Fine-tuning: Train on your specific task/domain
  3. Try Conservative Pruning: Remove fewer layers (8 instead of 13)

Technical Details

Pruning Configuration

{
  "original_model": "schneewolflabs/A0l-12B",
  "original_params": 12247782400,
  "pruned_params": 8703462400,
  "reduction_percent": 28.93,
  "pruning_type": "depth",
  "layers_removed": 13,
  "removed_layer_indices": "13-25"
}

Architecture Comparison

Component Original Pruned Status
Layers 40 27 ⚠️ Changed
Hidden Size 5120 5120 βœ… Same
Intermediate Size 14336 14336 βœ… Same
Attention Heads 32 32 βœ… Same
KV Heads 8 8 βœ… Same
Vocab Size 131072 131072 βœ… Same

Pruning Methodology

This model was pruned using structured depth pruning:

  1. Identified redundant middle layers (13-25)
  2. Removed entire transformer blocks
  3. Preserved architectural integrity
  4. No retraining applied (zero-shot pruning)

Tools used: Torch-Pruning

Limitations

  • Output quality degraded vs. original (no distillation applied)
  • May produce incoherent text on complex prompts
  • Not suitable for production without further training
  • Intended as a research/development artifact

Citation

If you use this model, please cite the original A0l-12B:

Original model: schneewolflabs/A0l-12B
Base architecture: Mistral-Nemo-12B
Pruning method: Depth pruning (layer removal)

License

Follows the same license as the original A0l-12B model.

Acknowledgments

Downloads last month
10
Safetensors
Model size
9B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for nbeerbower/A0l-8B-PRUNE

Quantizations
2 models