Instructions to use mlx-community/Lance-3B-AWQ-INT4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/Lance-3B-AWQ-INT4 with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("mlx-community/Lance-3B-AWQ-INT4") config = load_config("mlx-community/Lance-3B-AWQ-INT4") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
Lance-3B-AWQ-INT4
Note: "Lance" here refers to ByteDance Intelligent Creation Lab's unified multimodal model (arXiv:2605.18678), not Lance/LanceDB (the columnar data format).
MLX AWQ-INT4 quantization of bytedance-research/Lance — calibrated for VQA / image-understanding use on Apple Silicon.
| Property | Value |
|---|---|
| Source | bytedance-research/Lance — Lance_3B variant |
| Quantization | AWQ INT4 (Reza2kn-style alpha-search + scale fusion), group_size=128 |
| Avg bits/weight | 4.28 |
| On-disk size (LLM only) | 3.31 GB (27% of bf16's 12.4 GB LLM) |
| On-disk size (full repo incl. VAE + ViT) | ~5.7 GB |
| License | Apache 2.0 |
| MLX min RAM | ~6 GB (fits comfortably in 8–16 GB Macs) |
⚠️ Scope: VQA only — NOT for image generation
Use this model for: image understanding / VQA / captioning (the x2t_image task family).
Do NOT use this model for: text-to-image (t2i), image editing (image_edit), or video tasks. Naive AND calibrated quantization at every tested bit-width (4-bit, 8-bit, with and without AWQ, GEN-tower quantized or preserved at bf16) produce ~80% high-frequency detail loss on Lance image generation. For image generation, use the bf16 variant: mlx-community/Lance-3B-bf16.
Quality on the diagnostic VQA sweep
Validated against 6 oracle cases (tests/fixtures/results/x2t_image_sample_* in the source repo). The relevant comparison is AWQ-INT4 answer parity with the bf16 reference, since bf16 is the calibration target.
| Case | Question type | bf16 vs AWQ-INT4 parity |
|---|---|---|
| 1 | yes/no reasoning over a chart | ✓ identical |
| 2 | percentage extraction (short numeric) | ✓ identical |
| 3 | license plate extraction | ✗ AWQ garbles ("Bx62bfy" → "Byfky") |
| 4 | currency amount (large number) | ✗ AWQ divergent ("1.8 million" → "198%") |
| 5 | Colosseum description (open-ended) | ✓ semantically equivalent |
| 6 | solar eclipse description (open-ended) | ~ marginal (same topic, different specifics) |
Honest summary: ~4/6 cases preserve bf16 behavior closely. AWQ-INT4 is reliable for categorical and open-ended descriptive VQA, but degrades on precision-required outputs: alphanumeric extraction (license plates), exact numeric values (currency, percentages spanning units), and similar high-precision token-level reasoning. The 4-bit precision floor isn't enough to preserve fine token-level lexical relationships.
For applications that need exact extraction of numbers / IDs / dates / proper names, use bf16. For descriptive VQA, AWQ-INT4 is a usable 4× memory + 6-9× speed win.
Speed (M5 Max 128 GB, macOS 26.2, greedy decode)
| Oracle case | Output type | bf16 latency | AWQ-INT4 latency | Speedup |
|---|---|---|---|---|
| 1 | "Yes" (1 token) | 0.6 s | 0.4 s | 1.5× |
| 2 | "43" (2 tokens) | 0.6 s | 0.3 s | 2.0× |
| 3 | License plate (short) | 1.1 s | 0.4 s | 2.8× |
| 4 | Currency description (~30 tokens) | 6.4 s | 0.7 s | 9.1× |
| 5 | Colosseum description (~80 tokens) | 12.1 s | 1.4 s | 8.6× |
| 6 | Eclipse description (~70 tokens) | 8.6 s | 1.3 s | 6.6× |
| total | — | 29.4 s | 4.5 s | 6.5× wall-clock |
Long-form decoding sees the biggest speedup — exactly the user-visible case for descriptive VQA.
Usage
from lance_mlx.pipeline.understanding import UnderstandingPipeline
from PIL import Image
pipe = UnderstandingPipeline.from_pretrained(
lance_weights_dir="path/to/Lance-3B-AWQ-INT4",
vit_safetensors="path/to/Lance-3B-AWQ-INT4/vit.safetensors",
)
image = Image.open("photo.jpg").convert("RGB")
answer = pipe.generate(
image, "What is in this image?", max_new_tokens=256,
)
print(answer)
Install lance-mlx directly from the source repo (PyPI release pending —
see xocialize/lance-mlx backlog):
pip install git+https://github.com/xocialize/lance-mlx
What got quantized
- Quantization: MLX
nn.quantizemode="affine",bits=4,group_size=128 - Calibration: Reza2kn/lance-quant AWQ algorithm ported to MLX. Alpha-grid search ∈ [0, 1] per fusion group, scale fused into preceding RMSNorm
- Calibration corpus: 4-prompt t2i sweep yielding 152,790 tokens of activation data per Linear (full t2i forward exercises both UND and GEN tower consumers via Lance's MoE routing)
- Both UND and GEN towers quantized to INT4. Always-bf16 modules:
time_embedder.proj_in,time_embedder.proj_out,llm2vae - Per-fusion-group alpha distribution: mean 0.37, median 0.35, range [0.25, 0.55]
- qk_norms preserved (vs Reza2kn's PyTorch which drops them in their UND-only repackaging)
Full methodology + experimental records in xocialize/lance-mlx under notes/phase5n_diagnostics/phase5c3_awq_port/.
Research-closed: why this variant is VQA-only (and why no other quant variant exists)
This model is the final shipping outcome of a quantization research effort that ran through May 2026. The effort is closed as research — we are not actively developing further quant variants for Lance. Here's the honest summary:
The image-generation gap
Phase 5c-2 (naive 8-bit), Phase 5c-3d/e (AWQ-INT4 full + AWQ-INT8 + AWQ-INT4-und) all produced ~80% high-frequency detail loss on Lance image generation. AWQ-INT4 is modestly better than naive 8-bit per-prompt (3-15 percentage points HF improvement), but no quantization recipe tested closes the gap to bf16.
Why no quant scheme closes it (Phase 5c-3h finding)
Weight-level introspection at 6 representative Linears across Lance's 36-layer stack showed:
- AWQ math is working correctly per-Linear. It reduces per-layer output MSE by 28% on average at 8-bit and 20% at 4-bit. Weight MSE goes UP (as designed — AWQ trades uniform weight error for outlier-channel output error). The algorithm is doing what the algorithm is supposed to do.
- Per-layer gains don't compound into end-to-end image quality. Lance t2i runs 2,160 forward-pass evaluations per image (36 layers × 30 Euler steps × 2 CFG arms). Errors at each step feed the next step's input via the flow-matching integrator. Per-step quant improvements average out over this long path.
- Middle layers (around layer 18) are AWQ's blind spot — their activations don't have the strong per-channel outlier pattern AWQ assumes. Middle-layer AWQ regressions partially cancel peripheral-layer gains.
The 80% HF floor is architectural, not algorithmic. k-quants from llama.cpp would face the same compounding problem. NVFP4 would face it. Custom Metal kernels would face it. No quant scheme tested or hypothesized would close this floor without changing Lance's architecture itself.
So when does AWQ-INT4 work?
VQA (image-understanding) doesn't have the compounding problem — a single forward pass producing a text answer, not a 30-step denoise + VAE decode chain. The per-layer AWQ improvements DO translate into preserved answer behavior. That's why this variant ships for VQA only and bf16 ships for t2i.
For the full research record (8 sub-phases, ~80 pages of empirical writeups) see xocialize/lance-mlx under notes/phase5n_diagnostics/phase5c3_awq_port/.
Attribution
- Upstream weights: bytedance-research/Lance (Apache 2.0)
- Wan2.2 VAE: Alibaba Wan-AI team (Apache 2.0)
- Qwen2.5-VL ViT (vision encoder init): Alibaba Qwen team (Apache 2.0)
- AWQ algorithm:
Reza2kn/lance-quant(alpha-search + scale fusion recipe ported to MLX) - MLX conversion + AWQ port:
xocialize/lance-mlx - Substrate packages:
Blaizzy/mlx-vlm
Citation
@article{fu2026lance,
title={Lance: Unified Multimodal Modeling by Multi-Task Synergy},
author={Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and others},
journal={arXiv preprint arXiv:2605.18678},
year={2026}
}
- Downloads last month
- 45
4-bit