Instructions to use JosephusCheung/GuanacoOnConsumerHardware with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use JosephusCheung/GuanacoOnConsumerHardware with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="JosephusCheung/GuanacoOnConsumerHardware")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("JosephusCheung/GuanacoOnConsumerHardware")
model = AutoModelForCausalLM.from_pretrained("JosephusCheung/GuanacoOnConsumerHardware")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use JosephusCheung/GuanacoOnConsumerHardware with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "JosephusCheung/GuanacoOnConsumerHardware"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "JosephusCheung/GuanacoOnConsumerHardware",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/JosephusCheung/GuanacoOnConsumerHardware

SGLang

How to use JosephusCheung/GuanacoOnConsumerHardware with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "JosephusCheung/GuanacoOnConsumerHardware" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "JosephusCheung/GuanacoOnConsumerHardware",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "JosephusCheung/GuanacoOnConsumerHardware" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "JosephusCheung/GuanacoOnConsumerHardware",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use JosephusCheung/GuanacoOnConsumerHardware with Docker Model Runner:
```
docker model run hf.co/JosephusCheung/GuanacoOnConsumerHardware
```

Try Multimodal version with Colab Free T4 demo:

This repository is for Guanaco model with 4-bit quantized weights. The model benefits from two novel techniques introduced by GPTQ: quantizing columns in order of decreasing activation size and performing sequential quantization within a single Transformer block. These innovations enable compact, consumer-level multilingual models to function effectively.

The Guanaco model aims to provide a minimal multilingual conversational model capable of handling simple Q&A interactions, with a comprehensive understanding of grammar, rich vocabulary, and stability similar to that of large-scale language models, for use as a human-computer interface.

However, due to the limitations of consumer hardware, it is impossible for models with the performance level of ChatGPT3.5/GPT4 to run independently. Our model, with a reduced number of parameters, can still operate on older hardware generations, requiring less than 6GB of memory after 4-bit quantization. The only constraint is the speed, which depends on the actual hardware configuration.

Instead of competing with large models like ChatGPT, we pursue a different approach: a functionally complete language model without any inherent knowledge or computational ability. We achieve this by integrating APIs for knowledge acquisition (e.g., querying online resources like Wikipedia or utilizing Wolfram|Alpha for calculations) to provide accurate information to users, rather than relying on the model's learning and understanding capabilities. The primary goal is to create a stable large-scale language model for human-computer interaction.

An example of this approach is processing long articles or PDF documents. With traditional ChatGPT3.5 API's single-threaded operation, text must be divided into segments and matched with user input, which is inefficient. Our minimal multilingual model can analyze text sentence by sentence, generating multiple human-readable questions for each sentence. It can then establish logical connections between these questions using a Question-Answer tree structure and algorithms like PageRank to provide users with answers based on preliminary logical analysis.

Furthermore, our model can be applied to summarizing web search results. These use-cases, which are challenging for large models due to cost, scale, and frequency limitations, are more feasible on local, small-scale, consumer-level hardware. This direction represents the next step in our efforts.

Downloads last month: 12