🚀 [Project] Embed + Rerank API for Apple Silicon (MLX/LM Studio) — feedback welcome!

#20
by joonsoo-me - opened
MLX Community org
edited 3 days ago

🚀 [Project] Embed + Rerank API for Apple Silicon (MLX/LM Studio) — feedback welcome!

Hi everyone! I’ve been running a CUDA + Ollama homelab for a while, but rising power and hardware costs pushed me to migrate to a Mac setup. Along the way I hit the usual “context length vs. cost” trade-offs, so I built a VectorDB + n8n proxy that intercepts requests before they hit the LLM. Recently I discovered LM Studio’s MLX backend and started experimenting with MLX-accelerated pipelines on Apple Silicon — which led me here. LM Studio’s MLX engine makes on-device inference very efficient on Macs, and in my tests it’s noticeably snappier than my previous setup. (LM Studio)

While exploring models, I came across the new Qwen3 Embedding/Reranking series. These models are designed specifically for retrieval tasks (embeddings + cross-encoder reranking) and come in multiple sizes, which felt like a great fit for a compact homelab stack. (Qwen, Hugging Face, GitHub, Milvus)

What I built

embed-rerank — a small FastAPI service that exposes simple REST endpoints for:

  • Embeddings (Qwen3 Embedding series)
  • Reranking (Qwen3 Reranker)
  • Designed to run on macOS with MLX (via LM Studio) or Metal/MPS as fallbacks, so you can keep everything local on Apple Silicon. (ml-explore.github.io, LM Studio)

Repo: https://github.com/joonsoome/embed-rerank

Why this might be useful

  • Homelab-friendly: simple REST API for RAG stacks; drop-in with your Vector DB + n8n (or other orchestrators).
  • Apple Silicon first: leverages the MLX runtime path on Macs; no external GPU required. (GitHub, opensource.apple.com)
  • Two-stage retrieval: dense retrieval + cross-encoder rerank with the Qwen3 series. (Qwen, Milvus)

Looking for feedback

This is early! I’d love sharp feedback on:

  • Packaging for pure MLX vs. MPS paths on Apple Silicon.
  • Model/runtime choices (e.g., Qwen3 Embedding + Reranker sizes, quantization/format) when running under LM Studio’s MLX engine.
  • Any pitfalls you’ve hit wiring MLX-backed components into RAG pipelines on macOS.

If you have time, please try it and open issues/PRs. I’m especially interested in reports across different M-series chips and LM Studio/MLX versions. Thanks!

— Joonsoo Kim


Refs: Apple’s MLX framework and LM Studio’s MLX engine; Qwen3 Embedding/Reranking series and a hands-on tutorial showing a two-stage retrieval pipeline. (GitHub, ml-explore.github.io, LM Studio, Qwen, Hugging Face, Milvus)

Sign up or log in to comment