Post
547
Inference for generative ai models looks like a mine field, but there’s a simple protocol for picking the best inference:
🌍 95% of users >> If you’re using open (large) models and need fast online inference, then use Inference providers on auto mode, and let it choose the best provider for the model. https://huggingface.co/docs/inference-providers/index
👷 fine-tuners/ bespoke >> If you’ve got custom setups, use Inference Endpoints to define a configuration from AWS, Azure, GCP. https://endpoints.huggingface.co/
🦫 Locals >> If you’re trying to stretch everything you can out of a server or local machine, use Llama.cpp, Jan, LMStudio or vLLM. https://huggingface.co/settings/local-apps#local-apps
🪟 Browsers >> If you need open models running right here in the browser, use transformers.js. https://github.com/huggingface/transformers.js
Let me know what you’re using, and if you think it’s more complex than this.
🌍 95% of users >> If you’re using open (large) models and need fast online inference, then use Inference providers on auto mode, and let it choose the best provider for the model. https://huggingface.co/docs/inference-providers/index
👷 fine-tuners/ bespoke >> If you’ve got custom setups, use Inference Endpoints to define a configuration from AWS, Azure, GCP. https://endpoints.huggingface.co/
🦫 Locals >> If you’re trying to stretch everything you can out of a server or local machine, use Llama.cpp, Jan, LMStudio or vLLM. https://huggingface.co/settings/local-apps#local-apps
🪟 Browsers >> If you need open models running right here in the browser, use transformers.js. https://github.com/huggingface/transformers.js
Let me know what you’re using, and if you think it’s more complex than this.