About

Mozilla's llamafile is an open-source project designed to simplify the deployment and execution of large language models (LLMs) by packaging them into single-file executables. It combines llama.cpp — a high-performance C++ implementation for running LLMs — with Cosmopolitan Libc, enabling cross-platform compatibility without requiring installation or complex dependencies.

This llamafile was created for granite-4.0-micro, a flexible 3b model perfect for command line usage.

Installation Guide (Linux, BSD, Mac)

Download the file:

wget https://huggingface.co/Otilde/granite-4.0-micro-llamafile/resolve/main/granite-4.0-micro-Q8_0.llamafile

You can alternatively install the Q4 or Q6 version instead.

Make it executable

chmod +x granite-4.0-micro-Q8_0.llamafile

Run it!

./granite-4.0-micro-Q8_0.llamafile

Note for Windows:

After downloading the llamafile, you will need to rename the file extension to include '.exe' at the end to be able to run it. This can be done directly through the File Explorer or in the Windows Terminal.

Rename-Item -Path "granite-4.0-micro-Q8_0.llamafile" -NewName "granite-4.0-micro-Q8_0.llamafile.exe"

.\granite-4.0-micro-Q8_0.llamafile

Since the file size is <;4GB there should be no problem running it, however anything past this amount may require WSL (https://mozilla-ai.github.io/llamafile/troubleshooting/)

Examples

You can run ./granite-4.0-micro-Q8_0.llamafile --help to see all available options. Here are a few example prompts that may be useful.

ls -lh | ./granite-4.0-micro-Q8_0.llamafile --no-display-prompt --chat -p "In this large directory, please tell me the name of the earliest file I created." --nologo --interactive-first --color -e --temp 0.4 --chat-template vicuna

man vim | ./granite-4.0-micro-Q8_0.llamafile --no-display-prompt --chat -p "What are some interesting Vim command line options found in the man pages?" --nologo --interactive-first --color -e --temp 0.6 --chat-template vicuna -n -2 -c 2048

--no-display-prompt and --nologo will reduce the amount of space taken up. This presents a cleaner chat interface.

--temp 0.4, For this model I found 0.4 to work the best to reduce any potenital hallucinations for tasks that need to be as straightforward as possible. For more creativity, I go with 0.8.

--chat-template vicuna , Interestingly, I’ve found that this chat template makes Granite more coherent when given lots of context, though it’s not necessary.

--n-predict N, Sets number of tokens to predict when generating text. A higher value will result in longer text, while a lower value will produce shorter text.

-c, this will reduce or increase the context size

If the model does not stop producing tokens you will need to manually stop it by hitting ctrl+c or ctrl+z to suspend the process.

That's it! The benefit of llamafiles is that they're incredibly quick and easy to run, but offer lots of power inside. For future llamafiles, I'll end up further customizing the .arg build to include more of these optimizations, so there's less for users to type out each time.

This llamafile was created using version v0.9.3 by Ɵtilde

Downloads last month: 13

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Otilde/granite-4.0-micro-llamafile

Base model

ibm-granite/granite-4.0-micro

Finetuned

(13)

this model

Collection including Otilde/granite-4.0-micro-llamafile

The Portability Toolkit

Collection

Powered by llamafile, these models run natively on any OS with zero configuration headaches. Just download and execute! • 1 item • Updated 3 days ago