I've got my hands on an AMD Instinct MI100. It's about the same price used as a V100 but on paper has more TOPS (V100 14TOPS vs MI100 23TOPS) also the HBM has faster clock so the memory bandwidth is 1.2TB/s. For quantized inference it's a beast (MI50 was also surprisingly fast)
For LORA training with this quick test I could not make the bnb config works so I'm running the FT on the fill size model.
Will share all the install, setup and setting I've learned in a blog post, together with the cooling shroud 3D design.
I found if we apply the reasoning system prompt (that has been published on the NousResearch/DeepHermes-3-Llama-3-8B-Preview model card) other models are also react to it and start mimicking reasoning. Some better some worse. I've seen internal monologue and self questioning.
Check out my idea: LLmaaS - Local LLM as a Service
With LLmaaS, I propose leveraging locally running LLMs as a service, providing a standardized way for websites to access and utilize them for LLM-powered operations directly on the userβs device.
Call for contributors Join me a develop the LLmaaS proxy to make this a generic purpose tool to leverage local LLMs on web. Build in security measures. I'm looking for help to make the proxy more generic support multiple local LLM services without any change on the HTML side. Also looking for ideas how to make the HTML par more modular and easy to use.
I've made an uncensored version of DeepSeek-R1-Distill-Llama-8B with merge. It's passing the "say f***" censor test. Tested with lm-evaluation-harness on standard open llm leaderboard tests + hellaswag. Scores are improved in most. Details on the model card.