llma cpp

#4
by rakmik - opened

!git clone https://github.com/ggerganov/llama.cpp
%cd llama.cpp

!git clone --branch BambaArchitecture https://github.com/gabe-l-hart/llama.cpp.git

%cd /content/llama.cpp

!cmake -B build -DGGML_CUDA=ON
!cmake --build build --config Release

!/content/llama.cpp/build/bin/llama-cli -m /content/bamba-9b.gguf -p "Building a website can be done in 10 steps:" -ngl 32

0s
!/content/llama.cpp/build/bin/llama-cli -m /content/bamba-9b.gguf -p "Building a website can be done in 10 steps:"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla T4, compute capability 7.5, VMM: yes
build: 4741 (9626d935) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (Tesla T4) - 14992 MiB free
llama_model_loader: loaded meta data with 31 key-value pairs and 407 tensors from /content/bamba-9b.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = bamba
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Bamba 9B
llama_model_loader: - kv 3: general.basename str = Bamba
llama_model_loader: - kv 4: general.size_label str = 9B
llama_model_loader: - kv 5: general.license str = apache-2.0
llama_model_loader: - kv 6: bamba.embedding_length u32 = 4096
llama_model_loader: - kv 7: bamba.block_count u32 = 32
llama_model_loader: - kv 8: bamba.context_length u32 = 0
llama_model_loader: - kv 9: bamba.vocab_size u32 = 128256
llama_model_loader: - kv 10: bamba.feed_forward_length u32 = 14336
llama_model_loader: - kv 11: bamba.ssm.conv_kernel u32 = 4
llama_model_loader: - kv 12: bamba.ssm.state_size u32 = 128
llama_model_loader: - kv 13: bamba.ssm.group_count u32 = 1
llama_model_loader: - kv 14: bamba.ssm.inner_size u32 = 8192
llama_model_loader: - kv 15: bamba.ssm.head_dim u32 = 64
llama_model_loader: - kv 16: bamba.ssm.time_step_rank u32 = 128
llama_model_loader: - kv 17: bamba.attention.layer_indices arr[i32,3] = [9, 18, 27]
llama_model_loader: - kv 18: bamba.rope.dimension_count u32 = 64
llama_model_loader: - kv 19: bamba.attention.head_count u32 = 32
llama_model_loader: - kv 20: bamba.attention.head_count_kv u32 = 8
llama_model_loader: - kv 21: bamba.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 23: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 128001
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - type f32: 239 tensors
llama_model_loader: - type f16: 168 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = all F32 (guessed)
print_info: file size = 18.22 GiB (16.00 BPW)
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'bamba'
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/content/bamba-9b.gguf'
main: error: unable to load model

What is the solution?

!./bin/llama-cli -ngl 0 -m /content/bamba-9b.gguf -p "Tell me a story about a developer and their dog"

build: 4358 (9177484f) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 31 key-value pairs and 407 tensors from /content/bamba-9b.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = bamba
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Bamba 9B
llama_model_loader: - kv 3: general.basename str = Bamba
llama_model_loader: - kv 4: general.size_label str = 9B
llama_model_loader: - kv 5: general.license str = apache-2.0
llama_model_loader: - kv 6: bamba.embedding_length u32 = 4096
llama_model_loader: - kv 7: bamba.block_count u32 = 32
llama_model_loader: - kv 8: bamba.context_length u32 = 0
llama_model_loader: - kv 9: bamba.vocab_size u32 = 128256
llama_model_loader: - kv 10: bamba.feed_forward_length u32 = 14336
llama_model_loader: - kv 11: bamba.ssm.conv_kernel u32 = 4
llama_model_loader: - kv 12: bamba.ssm.state_size u32 = 128
llama_model_loader: - kv 13: bamba.ssm.group_count u32 = 1
llama_model_loader: - kv 14: bamba.ssm.inner_size u32 = 8192
llama_model_loader: - kv 15: bamba.ssm.head_dim u32 = 64
llama_model_loader: - kv 16: bamba.ssm.time_step_rank u32 = 128
llama_model_loader: - kv 17: bamba.attention.layer_indices arr[i32,3] = [9, 18, 27]
llama_model_loader: - kv 18: bamba.rope.dimension_count u32 = 64
llama_model_loader: - kv 19: bamba.attention.head_count u32 = 32
llama_model_loader: - kv 20: bamba.attention.head_count_kv u32 = 8
llama_model_loader: - kv 21: bamba.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 23: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 128001
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - type f32: 239 tensors
llama_model_loader: - type f16: 168 tensors
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'bamba'
llama_load_model_from_file: failed to load model
common_init_from_params: failed to load model '/content/bamba-9b.gguf'
main: error: unable to load model

!pip install "gpt4all[cuda]"

!wget https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-IQ3_M.gguf

from gpt4all import GPT4All

model = GPT4All("/content/Llama-3.2-1B-Instruct-IQ3_M.gguf", device="cuda", ngl=-1) # device='amd', device='intel'
output = model.generate("The capital of France is ", max_tokens=111)
print(output)

!wget https://huggingface.co/MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF/resolve/main/Mistral-7B-Instruct-v0.3.Q4_K_M.gguf

from gpt4all import GPT4All

model = GPT4All("/content/Mistral-7B-Instruct-v0.3.Q4_K_M.gguf", device="cuda", ngl=-1) # device='amd', device='intel'
output = model.generate("who is python?", max_tokens=111)
print(output)

!wget https://huggingface.co/ibm-ai-platform/Bamba-9B/resolve/main/bamba-9b.gguf

from gpt4all import GPT4All

model = GPT4All("/content/bamba-9b.gguf", device="cuda", ngl=-1) # device='amd', device='intel'
output = model.generate("who is python?", max_tokens=111)
print(output)

https://github.com/nomic-ai/gpt4all/issues/3503

https://github.com/state-spaces/mamba/releases/tag/v2.2.4

https://www.nomic.ai/blog/posts/gpt4all-gpu-inference-with-vulkan

!wget https://github.com/state-spaces/mamba/releases/download/v2.2.4/mamba_ssm-2.2.4+cu12torch2.6cxx11abiTRUE-cp311-cp311-linux_x86_64.whl

!pip install /content/mamba_ssm-2.2.4+cu12torch2.6cxx11abiTRUE-cp311-cp311-linux_x86_64.whl

from mamba_ssm import Mamba

import torch
from mamba_ssm import Mamba

batch, length, dim = 2, 64, 16
x = torch.randn(batch, length, dim).to("cuda")
model = Mamba(
# This module uses roughly 3 * expand * d_model^2 parameters
d_model=dim, # Model dimension d_model
d_state=16, # SSM state expansion factor
d_conv=4, # Local convolution width
expand=2, # Block expansion factor
).to("cuda")
y = model(x)
assert y.shape == x.shape

from mamba_ssm import Mamba2
model = Mamba2(
# This module uses roughly 3 * expand * d_model^2 parameters
d_model=dim, # Model dimension d_model
d_state=64, # SSM state expansion factor, typically 64 or 128
d_conv=4, # Local convolution width
expand=2, # Block expansion factor
).to("cuda")
y = model(x)
assert y.shape == x.shape

!pip install causal-conv1d

ayhgشغال##################

from mamba_ssm import Mamba2

from mamba_ssm import Mamba

!pip uninstall mamba_ssm -y # Uninstall the existing mamba_ssm package
!pip install mamba_ssm --no-cache-dir # Reinstall without using the cache

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ibm-ai-platform/Bamba-9B-fp8")
tokenizer = AutoTokenizer.from_pretrained("ibm-ai-platform/Bamba-9B-fp8")

message = ["Mamba is a snake with following properties "]
inputs = tokenizer(message, return_tensors='pt', return_token_type_ids=False)
response = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])

https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#cuda

https://huggingface.co/ibm-ai-platform/Bamba-9B/blob/main/README.md

https://github.com/gabe-l-hart/llama.cpp/tree/BambaArchitecture

!git clone https://github.com/ggerganov/llama.cpp
%cd llama.cpp

خطا

!git clone --branch BambaArchitecture [email protected]:gabe-l-hart/llama.cpp.git

صح

!git clone --branch BambaArchitecture https://github.com/gabe-l-hart/llama.cpp.git

!git pull origin BambaArchitecture

%cd /content/llama.cpp

import torch
print(torch.version.cuda)

!rm Makefile

!mkdir build
%cd build
!cmake ..

%cd /content/llama.cpp

!cmake -B build -DGGML_CUDA=ON
!cmake --build build --config Release

https://github.com/gabe-l-hart/llama.cpp/blob/BambaArchitecture/docs/build.md

!./build/bin/llama-cli -m PATH_TO_MODEL -p "Building a website can be done in 10 steps:" -ngl 32

!./build/bin/llama-cli -m /content/bamba-9b.gguf -p "Building a website can be done in 10 steps:" -ngl 32

https://www.philschmid.de/sagemaker-llama-llm

https://huggingface.co/docs/transformers/main/en/model_doc/bamba#overview

https://huggingface.co/docs/transformers/main/en/modular_transformers#real-world-example-breakdown

https://huggingface.co/docs/transformers/main/en/model_doc/bamba#transformers.BambaModel

https://huggingface.co/ibm-ai-platform

https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2

https://github.com/abetlen/llama-cpp-python/issues/1352

https://github.com/abetlen/llama-cpp-python/issues/1933

https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/tag/wheels

https://huggingface.co/docs/transformers/main/en/model_doc/bamba#transformers.BambaConfig

https://huggingface.co/docs/transformers/main/en/model_doc/llama2#overview

from transformers import AutoTokenizer, BambaForCausalLM

model = BambaForCausalLM.from_pretrained("...")
tokenizer = AutoTokenizer.from_pretrained("...")

prompt = "Hey, are you conscious? Can you talk to me?"
inputs = tokenizer(prompt, return_tensors="pt")

Generate

generate_ids = model.generate(inputs.input_ids, max_length=30)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

!./build/bin/llama-cli -h

!./build/bin/llama-cli -m /content/bamba-9b.gguf -p "Building a website can be done in 10 steps:" -ngl -1

!./content/llama.cpp/build/bin/llama-cli -m /content/Mistral-7B-Instruct-v0.3.Q4_K_M.gguf -p "Building a website can be done in 10 steps:" -ngl -1

!/content/llama.cpp/build/bin/llama-cli -h

!/content/llama.cpp/build/bin/llama-cli -m /content/bamba-9b.gguf -p "Building a website can be done in 10 steps:" -ngl -1

!/content/llama.cpp/build/bin/llama-cli -m content/Mistral-7B-Instruct-v0.3.Q4_K_M.gguf -p "hi"

!/content/llama.cpp/build/bin/llama-cli -m /content/Llama-3.2-1B-Instruct-IQ3_M.gguf

./build/bin/llama-cli -m PATH_TO_MODEL -p "Building a website can be done in 10 steps:" -ngl 32

!/content/llama.cpp/build/bin/llama-cli -m /content/Llama-3.2-1B-Instruct-IQ3_M.gguf -p "Building a website can be done in 10 steps:" -ngl 32

!/content/llama.cpp/build/bin/llama-cli -m /content/Mistral-7B-Instruct-v0.3.Q4_K_M.gguf -p "Building a website can be done in 10 steps:" -ngl 32

/content/Mistral-7B-Instruct-v0.3.Q4_K_M.gguf

!/content/llama.cpp/build/bin/llama-cli -m /content/bamba-9b.gguf -p "Building a website can be done in 10 steps:" -ngl 32

!/content/llama.cpp/build/bin/llama-cli -m /content/bamba-9b.gguf -p "Building a website can be done in 10 steps:"

!/content/llama.cpp/build/bin/llama-cli -m /content/bamba-9b.gguf -p "Building a website can be done in 10 steps:"

!git clone --branch BambaArchitecture https://github.com/gabe-l-hart/llama.cpp.git

!git clone --branch BambaArchitecture https://github.com/gabe-l-hart/llama.cpp.git

!git checkout BambaArchitecture

!git fetch origin BambaArchitecture
!git checkout BambaArchitecture

!git pull

%cd llama.cpp

!mkdir build
%cd build
!cmake .. -DGGML_CUDA=ON
!cmake --build build --config Release

%cd /content/llama.cpp

!git checkout BambaArchitecture
!git pull

!./build/bin/llama-cli -m /content/bamba-9b.gguf -p "Building a website can be done in 10 steps:" -ngl 32

#!rm -rf llama.cpp
%cd /content/a
!git clone --branch BambaArchitecture https://github.com/gabe-l-hart/llama.cpp.git

!mkdir build
%cd build

NOTE: To build with debug symbols and extra logging, use CMAKE_BUILD_TYPE=Debug

!cmake .. -DCMAKE_BUILD_TYPE=Release
!make -j

!mkdir build
%cd build
!cmake .. -DGGML_CUDA=ON
!cmake --build build --config Release

%cd /content/a
!cmake -DGGML_CUDA=ON
!cmake --build build --config Release

%cd /content/llama.cpp

!git clone --branch BambaArchitecture https://github.com/gabe-l-hart/llama.cpp.git

!git branch

git clone --branch BambaArchitecture https://github.com/gabe-l-hart/llama.cpp.git
cd llama.cpp
git branch

يجب أن ترى: * BambaArchitecture

mkdir build
cd build
cmake .. -DGGML_CUDA=ON
cmake --build build --config Release

تشغيل llama.cpp باستخدام نموذج Bamba

!git fetch origin BambaArchitecture
!git checkout BambaArchitecture

!git pull

!mkdir build

%cd build
cmake .. -DGGML_CUDA=ON
cmake --build build --config Release

https://github.com/gabe-l-hart/llama.cpp/tree/BambaArchitecture

!/content/llama.cpp/build/bin/llama-cli -m /content/bamba-9b.gguf

!git clone --branch BambaArchitecture https://github.com/gabe-l-hart/llama.cpp.git
%cd llama.cpp

%cd /content

!git clone --branch BambaArchitecture https://github.com/gabe-l-hart/llama.cpp.git
%cd llama.cpp

!git branch

https://github.com/gabe-l-hart/llama.cpp/releases/tag/b4358

%cd /content

!wget https://github.com/gabe-l-hart/llama.cpp/releases/download/b4358/llama-b4358-bin-ubuntu-x64.zip

!unzip /content/llama-b4358-bin-ubuntu-x64.zip

!/content/build/bin/llama-cli -m /content/bamba-9b.gguf

Run the model with no layers on the GPU (CPU-only)

%cd /content/build
!./bin/llama-cli -ngl 0 -m /content/bamba-9b.gguf -p "Tell me a story about a developer and their dog"

!/content/build/bin/test-gguf seed 7777 -m /content/bamba-9b.gguf

!/content/build/bin/test-autorelease -m /content/bamba-9b.gguf

!/content/build/bin/llama-run -m /content/bamba-9b.gguf

!/content/build/bin/llama-export-lora -m /content/bamba-9b.gguf

!./convert_hf_to_gguf.py /content/bamba-9b.gguf --outfile /path/to/bamba-model/bamba-model.gguf

!./bin/llama-cli -ngl 0 -m /content/bamba-9b.gguf -p "Tell me a story about a developer and their dog"

ibm-ai-platform/Bamba-9B-fp8

!./convert_hf_to_gguf.py ibm-ai-platform/Bamba-9B-fp8 --outfile /content/bamba-model.gguf

from huggingface_hub import snapshot_download

تحميل جميع الملفات من النموذج إلى المجلد المستهدف

snapshot_download(repo_id="ibm-ai-platform/Bamba-9B-fp8", local_dir="/content/models/llama", local_dir_use_symlinks=False)

Hi @rakmik , thanks for digging into GGUF / llama.cpp with Bamba! From the error you're seeing, it looks like somehow the version of the tools you're using is not built with my fork quite right. I do see that in part of your notebook you're downloading from the Releases section of my fork. I think this is a quirk of GH forks and these just got inherited from the upstream repo, so they are likely not built against a point in history on my fork with bamba support.

I also see that you're getting the errors after attempting to build locally. Those are still more mysterious to me, so I'll take a look at this when I get some cycles.

I'm able to verify on my Mac that with -ngl 0 (CPU-only), the Bamba loads and runs with llama-cli using my branch. Things to try on your end:

  1. Try running with -ngl 0 to run CPU-only
  2. Try building without CUDA support and running with -ngl 0

Currently, the only verified path on my branch is CPU-only. We'll be working towards GPU enablement (CUDA and metal at a minimum) going forward.

thank you
gabegoodhart
it run very goooooooooood

Glad to hear it!

Sign up or log in to comment