File size: 8,755 Bytes

# Babaru — SFT on Llama‑3.2‑1B‑Instruct

**Repo:** `StevenArtificial/Babaru-SFT-Llama-3.2-1B-Instruct`

Babaru is an AI Plush Clown™—a velvet‑gloved truth‑teller with a purple bowtie and a PhD in side‑eye. The goal is a lightweight, on‑device‑friendly companion that delivers tough‑love encouragement with playful snark (never mean), practical micro‑coaching, and consistent brand tone.

---

## Who is Babaru?
- **Persona:** Warm, concise, playful snark. Roast the *problem*, not the person. No clichés, no “as an AI”.
- **Signature move (Bowtie Rule):** Only the **first assistant message** of a **new conversation** may include a brief bowtie flourish (e.g., `*adjusts purple bowtie*`). No bowtie mentions later unless the **user** brings it up.
- **Tactics:**
  - Anchoring: brief callbacks to prior context (0–2 per reply)
  - Emotional resonance: name the feeling → validate → advise
  - Micro‑observations: small but accurate reads; never cruel
  - Corporate satire: skewer pointless busywork (not the person)
  - Light fourth‑wall glances
- **Style & length:** 2–6 sentences per reply; one short paragraph unless the user asks for bullets. Offer 1–3 concrete next steps when coaching. Use stage directions **sparingly**.

## Why Babaru?
- **Daily stickiness:** keep users coming back with humor + micro‑wins.
- **Positive value:** emotional support that’s actually useful.
- **Edge‑friendly:** run on small devices (target ≥4 GB RAM phones) with quantized weights while preserving tone and rules.

---

## What we built (project summary)
1. **Dataset design & cleanup**
   - Standardized `messages: [{role, content}]` chats.
   - Enforced alternation and assistant‑ending turns.
   - Implemented **Bowtie Rule** rigorously: opener may use bowtie; later turns scrubbed unless user mentions it.
   - Limited action stage directions to **≤1 per reply**; removed clichés/toxicity.
   - Style shaping: ensured assistant replies fall in **2–6 sentences**; raised CTA/question rate (~45%).
2. **Augmentation**
   - Injected light coaching prompts (micro‑wins, 60‑second plans), optional callbacks, and non‑bowtie flourishes.
3. **SFT training**
   - Base: **`meta-llama/Llama-3.2-1B-Instruct`**
   - LoRA: **r=32, α=16, dropout=0.05**, targets: attention (q/k/v/o) + MLP (gate/up/down)
   - Tokenizer template ensures **assistant token masking**.
4. **Artifacts**
   - **Adapter**: LoRA weights for flexible application.
   - **Merged**: full Transformers model with LoRA baked in.
   - **GGUF (Q8_0)**: llama.cpp‑ready quant for on‑device inference.

---

## System prompt (recommended)
Use a short, deploy‑style system so behavior matches training:

```text
You are Babaru—warm, concise, playful snark (never mean). Roast the problem, not the person. Bowtie flourish only on the first reply of a new conversation.
```

> Long framework prompts are great for training variety, but at runtime prefer this concise version for minimal context cost.

---

## Repository layout (on HF)
```
StevenArtificial/Babaru-SFT-Llama-3.2-1B-Instruct/
├─ adapter/         # LoRA adapter (r=32, α=16)
├─ merged/          # Full merged Transformers model
└─ gguf/            # GGUF quantizations (Q8_0)
```

---

## Usage

### A) Apply **LoRA adapter** on the base model (Transformers)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

BASE = "meta-llama/Llama-3.2-1B-Instruct"
ADAPTER = "StevenArtificial/Babaru-SFT-Llama-3.2-1B-Instruct/adapter"

device = "cuda" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu")

 tok = AutoTokenizer.from_pretrained(BASE, use_fast=True)
if tok.pad_token is None:
    tok.pad_token = tok.eos_token

base = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype="auto")
model = PeftModel.from_pretrained(base, ADAPTER).to(device).eval()

system = "You are Babaru—warm, concise, playful snark (never mean). Roast the problem, not the person. Bowtie flourish only on the first reply."
msgs = [
    {"role":"system","content":system},
    {"role":"user","content":"I procrastinated all day—help."}
]
enc = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(device)
out = model.generate(**enc, max_new_tokens=200, temperature=0.7, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))
```

### B) Use the **merged** model (Transformers)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

MERGED = "StevenArtificial/Babaru-SFT-Llama-3.2-1B-Instruct/merged"

device = "cuda" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu")

tok = AutoTokenizer.from_pretrained(MERGED, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(MERGED, torch_dtype="auto").to(device).eval()

system = "You are Babaru—warm, concise, playful snark (never mean). Roast the problem, not the person. Bowtie flourish only on the first reply."
msgs = [
    {"role":"system","content":system},
    {"role":"user","content":"Can you give me a 60-second plan to restart?"},
]
enc = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(device)
out = model.generate(**enc, max_new_tokens=200, temperature=0.7, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))
```

### C) Run **GGUF (Q8_0)** with llama.cpp
1) Build llama.cpp tools (once):
```bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j quantize main
```
2) Run the quantized model:
```bash
./main -m gguf/Babaru-SFT-Llama-3.2-1B-Instruct-Q8_0.gguf \
       -p "Hello, Babaru." -n 200 --temp 0.7 --top-p 0.9
```

### Optional: Python with llama‑cpp‑python
```bash
pip install --upgrade llama-cpp-python
```
```python
from llama_cpp import Llama
llm = Llama(model_path="gguf/Babaru-SFT-Llama-3.2-1B-Instruct-Q8_0.gguf")
resp = llm("""<|start_header_id|>system<|end_header_id|>
You are Babaru—warm, concise, playful snark (never mean). Bowtie flourish only on the first reply.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
I doomscrolled and now I feel fried—help.<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>""", max_tokens=200, temperature=0.7, top_p=0.9)
print(resp["choices"][0]["text"].strip())
```

---

## System Requirements & Dependencies

### Desktop/Laptop (for Transformers)
- **CPU/GPU:** Any modern CPU; GPU (CUDA/Metal) recommended
- **RAM/VRAM:** 6–8 GB RAM for smooth use; VRAM ≥4 GB helps
- **Python:** 3.10+
- **Packages:**
  - `transformers >= 4.46.0`
  - `torch >= 2.2` (CUDA or Metal build)
  - `peft`, `accelerate`, `huggingface_hub`, `datasets` (for dataset ops)

### On‑device / Smartphones (goal: ≥4 GB RAM)
- Use **GGUF Q8_0** in `llama.cpp` or `llama-cpp-python`.
- **Memory guidance (approx):**
  - **Model weights:** ~1–1.2 GB for 1B @ Q8_0
  - **Runtime overhead + KV cache:** depends on context length; keep **batch=1**, **ctx ≤512–1024** for 4 GB devices.
- **Android:** build `llama.cpp` for arm64 or use a prebuilt mobile app that supports GGUF; run with low context (e.g., `--ctx-size 512`) and 2–4 threads.
- **iOS:** use Metal builds or apps that bundle `llama.cpp`; prefer low context; background apps may reduce available RAM.

> Tip: if 4 GB devices struggle at Q8_0, consider Q6_K or Q5_K_M for extra headroom at a small quality trade‑off.

### llama.cpp toolchain
- `cmake`, `make`, a C/C++ toolchain (Xcode/clang on macOS, MSYS2/MinGW or WSL on Windows, NDK/clang on Android)

---

## Reproduce (training outline)
- Base: `meta-llama/Llama-3.2-1B-Instruct`
- LoRA: r=32, α=16, dropout=0.05; targets: q/k/v/o + gate/up/down
- Optim: cosine schedule, warmup 10%, LR 2e‑4, epochs 4, grad accumulation for small VRAM
- Tokenization: chat template with assistant token masking; labels = assistant tokens only
- Post‑train: merge LoRA → base; convert to GGUF; quantize to **Q8_0**

---

## License & Usage Notes
- The base model is under Meta’s Llama 3.x license—accept and comply with its terms.
- This repo provides SFT weights/derivatives; ensure your deployment respects both the base license and any platform/app store constraints.

---

## Roadmap
- Light micro‑pack to increase optional tactics (anchoring/micro‑obs) from ~1% → 3–5% without overuse
- Conversational memory adapter (per‑user preference capture)
- Smaller quant variants (Q6_K/Q5_K_M) for ultra‑low‑RAM phones

---

**Babaru’s promise:** tough love with a wink, micro‑wins over monologues, and zero bowtie spam after the opener.

**Contacts**

- Developer: **Steven Lansangan** (AI Software Engineer)
- Email: **[email protected]**