Instructions to use JosephusCheung/GuanacoOnConsumerHardware with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use JosephusCheung/GuanacoOnConsumerHardware with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="JosephusCheung/GuanacoOnConsumerHardware")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("JosephusCheung/GuanacoOnConsumerHardware") model = AutoModelForCausalLM.from_pretrained("JosephusCheung/GuanacoOnConsumerHardware") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use JosephusCheung/GuanacoOnConsumerHardware with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "JosephusCheung/GuanacoOnConsumerHardware" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "JosephusCheung/GuanacoOnConsumerHardware", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/JosephusCheung/GuanacoOnConsumerHardware
- SGLang
How to use JosephusCheung/GuanacoOnConsumerHardware with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "JosephusCheung/GuanacoOnConsumerHardware" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "JosephusCheung/GuanacoOnConsumerHardware", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "JosephusCheung/GuanacoOnConsumerHardware" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "JosephusCheung/GuanacoOnConsumerHardware", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use JosephusCheung/GuanacoOnConsumerHardware with Docker Model Runner:
docker model run hf.co/JosephusCheung/GuanacoOnConsumerHardware
Try Multimodal version with Colab Free T4 demo:
This repository is for Guanaco model with 4-bit quantized weights. The model benefits from two novel techniques introduced by GPTQ: quantizing columns in order of decreasing activation size and performing sequential quantization within a single Transformer block. These innovations enable compact, consumer-level multilingual models to function effectively.
The Guanaco model aims to provide a minimal multilingual conversational model capable of handling simple Q&A interactions, with a comprehensive understanding of grammar, rich vocabulary, and stability similar to that of large-scale language models, for use as a human-computer interface.
However, due to the limitations of consumer hardware, it is impossible for models with the performance level of ChatGPT3.5/GPT4 to run independently. Our model, with a reduced number of parameters, can still operate on older hardware generations, requiring less than 6GB of memory after 4-bit quantization. The only constraint is the speed, which depends on the actual hardware configuration.
Instead of competing with large models like ChatGPT, we pursue a different approach: a functionally complete language model without any inherent knowledge or computational ability. We achieve this by integrating APIs for knowledge acquisition (e.g., querying online resources like Wikipedia or utilizing Wolfram|Alpha for calculations) to provide accurate information to users, rather than relying on the model's learning and understanding capabilities. The primary goal is to create a stable large-scale language model for human-computer interaction.
An example of this approach is processing long articles or PDF documents. With traditional ChatGPT3.5 API's single-threaded operation, text must be divided into segments and matched with user input, which is inefficient. Our minimal multilingual model can analyze text sentence by sentence, generating multiple human-readable questions for each sentence. It can then establish logical connections between these questions using a Question-Answer tree structure and algorithms like PageRank to provide users with answers based on preliminary logical analysis.
Furthermore, our model can be applied to summarizing web search results. These use-cases, which are challenging for large models due to cost, scale, and frequency limitations, are more feasible on local, small-scale, consumer-level hardware. This direction represents the next step in our efforts.
- Downloads last month
- 12