✨3B with MIT license ✨Long context windows up to 128K ✨Strong multimodal reasoning (36.8% on MathVision, on par with 10x larger models) and agent skills (34.5% on ScreenSpot-Pro)
We desperately need GPU for model inference. CPU can't replace GPU.
I will start with the basics. GPU is designed to serve predictable workloads with many parallel units (pixels, tensors, tokens). So a GPU allocates as much transistor budget as possible to build thousands of compute units (Cuda cores in NVidia or execution units in Apple Silicon), each capable of running a thread.
But CPU is designed to handle all kinds of workloads. CPU cores are much larger (hence a lot fewer) with branch prediction and other complex things. In addition, more and more transistors are allocated to build larger cache (~50% now) to house the unpredictable, devouring the compute budget.
Google published a 69-page whitepaper on Prompt Engineering and its best practices, a must-read if you are using LLMs in production: > zero-shot, one-shot, few-shot > system prompting > chain-of-thought (CoT) > ReAct