Lance-3B-AWQ-INT4

Note: "Lance" here refers to ByteDance Intelligent Creation Lab's unified multimodal model (arXiv:2605.18678), not Lance/LanceDB (the columnar data format).

MLX AWQ-INT4 quantization of bytedance-research/Lance — calibrated for VQA / image-understanding use on Apple Silicon.

Property Value
Source bytedance-research/Lance — Lance_3B variant
Quantization AWQ INT4 (Reza2kn-style alpha-search + scale fusion), group_size=128
Avg bits/weight 4.28
On-disk size (LLM only) 3.31 GB (27% of bf16's 12.4 GB LLM)
On-disk size (full repo incl. VAE + ViT) ~5.7 GB
License Apache 2.0
MLX min RAM ~6 GB (fits comfortably in 8–16 GB Macs)

⚠️ Scope: VQA only — NOT for image generation

Use this model for: image understanding / VQA / captioning (the x2t_image task family).

Do NOT use this model for: text-to-image (t2i), image editing (image_edit), or video tasks. Naive AND calibrated quantization at every tested bit-width (4-bit, 8-bit, with and without AWQ, GEN-tower quantized or preserved at bf16) produce ~80% high-frequency detail loss on Lance image generation. For image generation, use the bf16 variant: mlx-community/Lance-3B-bf16.

Quality on the diagnostic VQA sweep

Validated against 6 oracle cases (tests/fixtures/results/x2t_image_sample_* in the source repo). The relevant comparison is AWQ-INT4 answer parity with the bf16 reference, since bf16 is the calibration target.

Case Question type bf16 vs AWQ-INT4 parity
1 yes/no reasoning over a chart ✓ identical
2 percentage extraction (short numeric) ✓ identical
3 license plate extraction ✗ AWQ garbles ("Bx62bfy" → "Byfky")
4 currency amount (large number) ✗ AWQ divergent ("1.8 million" → "198%")
5 Colosseum description (open-ended) ✓ semantically equivalent
6 solar eclipse description (open-ended) ~ marginal (same topic, different specifics)

Honest summary: ~4/6 cases preserve bf16 behavior closely. AWQ-INT4 is reliable for categorical and open-ended descriptive VQA, but degrades on precision-required outputs: alphanumeric extraction (license plates), exact numeric values (currency, percentages spanning units), and similar high-precision token-level reasoning. The 4-bit precision floor isn't enough to preserve fine token-level lexical relationships.

For applications that need exact extraction of numbers / IDs / dates / proper names, use bf16. For descriptive VQA, AWQ-INT4 is a usable 4× memory + 6-9× speed win.

Speed (M5 Max 128 GB, macOS 26.2, greedy decode)

Oracle case Output type bf16 latency AWQ-INT4 latency Speedup
1 "Yes" (1 token) 0.6 s 0.4 s 1.5×
2 "43" (2 tokens) 0.6 s 0.3 s 2.0×
3 License plate (short) 1.1 s 0.4 s 2.8×
4 Currency description (~30 tokens) 6.4 s 0.7 s 9.1×
5 Colosseum description (~80 tokens) 12.1 s 1.4 s 8.6×
6 Eclipse description (~70 tokens) 8.6 s 1.3 s 6.6×
total 29.4 s 4.5 s 6.5× wall-clock

Long-form decoding sees the biggest speedup — exactly the user-visible case for descriptive VQA.

Usage

from lance_mlx.pipeline.understanding import UnderstandingPipeline
from PIL import Image

pipe = UnderstandingPipeline.from_pretrained(
    lance_weights_dir="path/to/Lance-3B-AWQ-INT4",
    vit_safetensors="path/to/Lance-3B-AWQ-INT4/vit.safetensors",
)
image = Image.open("photo.jpg").convert("RGB")
answer = pipe.generate(
    image, "What is in this image?", max_new_tokens=256,
)
print(answer)

Install lance-mlx directly from the source repo (PyPI release pending — see xocialize/lance-mlx backlog):

pip install git+https://github.com/xocialize/lance-mlx

What got quantized

  • Quantization: MLX nn.quantize mode="affine", bits=4, group_size=128
  • Calibration: Reza2kn/lance-quant AWQ algorithm ported to MLX. Alpha-grid search ∈ [0, 1] per fusion group, scale fused into preceding RMSNorm
  • Calibration corpus: 4-prompt t2i sweep yielding 152,790 tokens of activation data per Linear (full t2i forward exercises both UND and GEN tower consumers via Lance's MoE routing)
  • Both UND and GEN towers quantized to INT4. Always-bf16 modules: time_embedder.proj_in, time_embedder.proj_out, llm2vae
  • Per-fusion-group alpha distribution: mean 0.37, median 0.35, range [0.25, 0.55]
  • qk_norms preserved (vs Reza2kn's PyTorch which drops them in their UND-only repackaging)

Full methodology + experimental records in xocialize/lance-mlx under notes/phase5n_diagnostics/phase5c3_awq_port/.

Research-closed: why this variant is VQA-only (and why no other quant variant exists)

This model is the final shipping outcome of a quantization research effort that ran through May 2026. The effort is closed as research — we are not actively developing further quant variants for Lance. Here's the honest summary:

The image-generation gap

Phase 5c-2 (naive 8-bit), Phase 5c-3d/e (AWQ-INT4 full + AWQ-INT8 + AWQ-INT4-und) all produced ~80% high-frequency detail loss on Lance image generation. AWQ-INT4 is modestly better than naive 8-bit per-prompt (3-15 percentage points HF improvement), but no quantization recipe tested closes the gap to bf16.

Why no quant scheme closes it (Phase 5c-3h finding)

Weight-level introspection at 6 representative Linears across Lance's 36-layer stack showed:

  • AWQ math is working correctly per-Linear. It reduces per-layer output MSE by 28% on average at 8-bit and 20% at 4-bit. Weight MSE goes UP (as designed — AWQ trades uniform weight error for outlier-channel output error). The algorithm is doing what the algorithm is supposed to do.
  • Per-layer gains don't compound into end-to-end image quality. Lance t2i runs 2,160 forward-pass evaluations per image (36 layers × 30 Euler steps × 2 CFG arms). Errors at each step feed the next step's input via the flow-matching integrator. Per-step quant improvements average out over this long path.
  • Middle layers (around layer 18) are AWQ's blind spot — their activations don't have the strong per-channel outlier pattern AWQ assumes. Middle-layer AWQ regressions partially cancel peripheral-layer gains.

The 80% HF floor is architectural, not algorithmic. k-quants from llama.cpp would face the same compounding problem. NVFP4 would face it. Custom Metal kernels would face it. No quant scheme tested or hypothesized would close this floor without changing Lance's architecture itself.

So when does AWQ-INT4 work?

VQA (image-understanding) doesn't have the compounding problem — a single forward pass producing a text answer, not a 30-step denoise + VAE decode chain. The per-layer AWQ improvements DO translate into preserved answer behavior. That's why this variant ships for VQA only and bf16 ships for t2i.

For the full research record (8 sub-phases, ~80 pages of empirical writeups) see xocialize/lance-mlx under notes/phase5n_diagnostics/phase5c3_awq_port/.

Attribution

Citation

@article{fu2026lance,
  title={Lance: Unified Multimodal Modeling by Multi-Task Synergy},
  author={Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and others},
  journal={arXiv preprint arXiv:2605.18678},
  year={2026}
}
Downloads last month
45
Safetensors
Model size
0.9B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/Lance-3B-AWQ-INT4

Quantized
(16)
this model

Paper for mlx-community/Lance-3B-AWQ-INT4