Spaces:
Running
🚀 [Project] Embed + Rerank API for Apple Silicon (MLX/LM Studio) — feedback welcome!
🚀 [Project] Embed + Rerank API for Apple Silicon (MLX/LM Studio) — feedback welcome!
Hi everyone! I’ve been running a CUDA + Ollama homelab for a while, but rising power and hardware costs pushed me to migrate to a Mac setup. Along the way I hit the usual “context length vs. cost” trade-offs, so I built a VectorDB + n8n proxy that intercepts requests before they hit the LLM. Recently I discovered LM Studio’s MLX backend and started experimenting with MLX-accelerated pipelines on Apple Silicon — which led me here. LM Studio’s MLX engine makes on-device inference very efficient on Macs, and in my tests it’s noticeably snappier than my previous setup. (LM Studio)
While exploring models, I came across the new Qwen3 Embedding/Reranking series. These models are designed specifically for retrieval tasks (embeddings + cross-encoder reranking) and come in multiple sizes, which felt like a great fit for a compact homelab stack. (Qwen, Hugging Face, GitHub, Milvus)
What I built
embed-rerank — a small FastAPI service that exposes simple REST endpoints for:
- Embeddings (Qwen3 Embedding series)
- Reranking (Qwen3 Reranker)
- Designed to run on macOS with MLX (via LM Studio) or Metal/MPS as fallbacks, so you can keep everything local on Apple Silicon. (ml-explore.github.io, LM Studio)
Repo: https://github.com/joonsoome/embed-rerank
Why this might be useful
- Homelab-friendly: simple REST API for RAG stacks; drop-in with your Vector DB + n8n (or other orchestrators).
- Apple Silicon first: leverages the MLX runtime path on Macs; no external GPU required. (GitHub, opensource.apple.com)
- Two-stage retrieval: dense retrieval + cross-encoder rerank with the Qwen3 series. (Qwen, Milvus)
Looking for feedback
This is early! I’d love sharp feedback on:
- Packaging for pure MLX vs. MPS paths on Apple Silicon.
- Model/runtime choices (e.g., Qwen3 Embedding + Reranker sizes, quantization/format) when running under LM Studio’s MLX engine.
- Any pitfalls you’ve hit wiring MLX-backed components into RAG pipelines on macOS.
If you have time, please try it and open issues/PRs. I’m especially interested in reports across different M-series chips and LM Studio/MLX versions. Thanks!
— Joonsoo Kim
Refs: Apple’s MLX framework and LM Studio’s MLX engine; Qwen3 Embedding/Reranking series and a hands-on tutorial showing a two-stage retrieval pipeline. (GitHub, ml-explore.github.io, LM Studio, Qwen, Hugging Face, Milvus)