Lance-3B-AWQ-INT4

Note: "Lance" here refers to ByteDance Intelligent Creation Lab's unified multimodal model (arXiv:2605.18678), not Lance/LanceDB (the columnar data format).

MLX AWQ-INT4 quantization of bytedance-research/Lance — calibrated for VQA / image-understanding use on Apple Silicon.

Property	Value
Source	bytedance-research/Lance — Lance_3B variant
Quantization	AWQ INT4 (Reza2kn-style alpha-search + scale fusion), group_size=128
Avg bits/weight	4.28
On-disk size (LLM only)	3.31 GB (27% of bf16's 12.4 GB LLM)
On-disk size (full repo incl. VAE + ViT)	~5.7 GB
License	Apache 2.0
MLX min RAM	~6 GB (fits comfortably in 8–16 GB Macs)

⚠️ Scope: VQA only — NOT for image generation

Use this model for: image understanding / VQA / captioning (the x2t_image task family).

Do NOT use this model for: text-to-image (t2i), image editing (image_edit), or video tasks. Naive AND calibrated quantization at every tested bit-width (4-bit, 8-bit, with and without AWQ, GEN-tower quantized or preserved at bf16) produce ~80% high-frequency detail loss on Lance image generation. For image generation, use the bf16 variant: mlx-community/Lance-3B-bf16.

Quality on the diagnostic VQA sweep

Validated against 6 oracle cases (tests/fixtures/results/x2t_image_sample_* in the source repo). The relevant comparison is AWQ-INT4 answer parity with the bf16 reference, since bf16 is the calibration target.

Case	Question type	bf16 vs AWQ-INT4 parity
1	yes/no reasoning over a chart	✓ identical
2	percentage extraction (short numeric)	✓ identical
3	license plate extraction	✗ AWQ garbles ("Bx62bfy" → "Byfky")
4	currency amount (large number)	✗ AWQ divergent ("1.8 million" → "198%")
5	Colosseum description (open-ended)	✓ semantically equivalent
6	solar eclipse description (open-ended)	~ marginal (same topic, different specifics)

Honest summary: ~4/6 cases preserve bf16 behavior closely. AWQ-INT4 is reliable for categorical and open-ended descriptive VQA, but degrades on precision-required outputs: alphanumeric extraction (license plates), exact numeric values (currency, percentages spanning units), and similar high-precision token-level reasoning. The 4-bit precision floor isn't enough to preserve fine token-level lexical relationships.

For applications that need exact extraction of numbers / IDs / dates / proper names, use bf16. For descriptive VQA, AWQ-INT4 is a usable 4× memory + 6-9× speed win.

Speed (M5 Max 128 GB, macOS 26.2, greedy decode)

Oracle case	Output type	bf16 latency	AWQ-INT4 latency	Speedup
1	"Yes" (1 token)	0.6 s	0.4 s	1.5×
2	"43" (2 tokens)	0.6 s	0.3 s	2.0×
3	License plate (short)	1.1 s	0.4 s	2.8×
4	Currency description (~30 tokens)	6.4 s	0.7 s	9.1×
5	Colosseum description (~80 tokens)	12.1 s	1.4 s	8.6×
6	Eclipse description (~70 tokens)	8.6 s	1.3 s	6.6×
total	—	29.4 s	4.5 s	6.5× wall-clock

Long-form decoding sees the biggest speedup — exactly the user-visible case for descriptive VQA.

Usage

from lance_mlx.pipeline.understanding import UnderstandingPipeline
from PIL import Image

pipe = UnderstandingPipeline.from_pretrained(
    lance_weights_dir="path/to/Lance-3B-AWQ-INT4",
    vit_safetensors="path/to/Lance-3B-AWQ-INT4/vit.safetensors",
)
image = Image.open("photo.jpg").convert("RGB")
answer = pipe.generate(
    image, "What is in this image?", max_new_tokens=256,
)
print(answer)

Install lance-mlx directly from the source repo (PyPI release pending — see xocialize/lance-mlx backlog):

pip install git+https://github.com/xocialize/lance-mlx

What got quantized

Quantization: MLX nn.quantize mode="affine", bits=4, group_size=128
Calibration: Reza2kn/lance-quant AWQ algorithm ported to MLX. Alpha-grid search ∈ [0, 1] per fusion group, scale fused into preceding RMSNorm
Calibration corpus: 4-prompt t2i sweep yielding 152,790 tokens of activation data per Linear (full t2i forward exercises both UND and GEN tower consumers via Lance's MoE routing)
Both UND and GEN towers quantized to INT4. Always-bf16 modules: time_embedder.proj_in, time_embedder.proj_out, llm2vae
Per-fusion-group alpha distribution: mean 0.37, median 0.35, range [0.25, 0.55]
qk_norms preserved (vs Reza2kn's PyTorch which drops them in their UND-only repackaging)

Full methodology + experimental records in xocialize/lance-mlx under notes/phase5n_diagnostics/phase5c3_awq_port/.

Research-closed: why this variant is VQA-only (and why no other quant variant exists)

This model is the final shipping outcome of a quantization research effort that ran through May 2026. The effort is closed as research — we are not actively developing further quant variants for Lance. Here's the honest summary:

The image-generation gap

Phase 5c-2 (naive 8-bit), Phase 5c-3d/e (AWQ-INT4 full + AWQ-INT8 + AWQ-INT4-und) all produced ~80% high-frequency detail loss on Lance image generation. AWQ-INT4 is modestly better than naive 8-bit per-prompt (3-15 percentage points HF improvement), but no quantization recipe tested closes the gap to bf16.

Why no quant scheme closes it (Phase 5c-3h finding)

Weight-level introspection at 6 representative Linears across Lance's 36-layer stack showed:

AWQ math is working correctly per-Linear. It reduces per-layer output MSE by 28% on average at 8-bit and 20% at 4-bit. Weight MSE goes UP (as designed — AWQ trades uniform weight error for outlier-channel output error). The algorithm is doing what the algorithm is supposed to do.
Per-layer gains don't compound into end-to-end image quality. Lance t2i runs 2,160 forward-pass evaluations per image (36 layers × 30 Euler steps × 2 CFG arms). Errors at each step feed the next step's input via the flow-matching integrator. Per-step quant improvements average out over this long path.
Middle layers (around layer 18) are AWQ's blind spot — their activations don't have the strong per-channel outlier pattern AWQ assumes. Middle-layer AWQ regressions partially cancel peripheral-layer gains.

The 80% HF floor is architectural, not algorithmic. k-quants from llama.cpp would face the same compounding problem. NVFP4 would face it. Custom Metal kernels would face it. No quant scheme tested or hypothesized would close this floor without changing Lance's architecture itself.

So when does AWQ-INT4 work?

VQA (image-understanding) doesn't have the compounding problem — a single forward pass producing a text answer, not a 30-step denoise + VAE decode chain. The per-layer AWQ improvements DO translate into preserved answer behavior. That's why this variant ships for VQA only and bf16 ships for t2i.

For the full research record (8 sub-phases, ~80 pages of empirical writeups) see xocialize/lance-mlx under notes/phase5n_diagnostics/phase5c3_awq_port/.

Attribution

Upstream weights: bytedance-research/Lance (Apache 2.0)
Wan2.2 VAE: Alibaba Wan-AI team (Apache 2.0)
Qwen2.5-VL ViT (vision encoder init): Alibaba Qwen team (Apache 2.0)
AWQ algorithm: Reza2kn/lance-quant (alpha-search + scale fusion recipe ported to MLX)
MLX conversion + AWQ port: xocialize/lance-mlx
Substrate packages: Blaizzy/mlx-vlm

Citation

@article{fu2026lance,
  title={Lance: Unified Multimodal Modeling by Multi-Task Synergy},
  author={Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and others},
  journal={arXiv preprint arXiv:2605.18678},
  year={2026}
}

Downloads last month: 45

Safetensors

Model size

0.9B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

4-bit

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/Lance-3B-AWQ-INT4

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

bytedance-research/Lance

Quantized

(16)

this model

Paper for mlx-community/Lance-3B-AWQ-INT4

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Paper • 2605.18678 • Published 12 days ago • 76