An Edge-First Generalized LLM LoRA Fine-Tuning Framework for Heterogeneous GPUs

Community Article Published December 1, 2025

KEY HIGHLIGHTS

Overview

1. Introduction
1.1 Parameter-Efficient Fine-Tuning

1.2 LLAMA.CPP

1.3 Our Contributions

2. Cross-Platform Training for Heterogeneous GPUs
2.1 LoRA

2.2 Low-Level Operators

2.3 Instruction Fine-Tuning

2.4 Enabling Fine-Tuning for Resource-Constrained Devices

3 Results
3.1 Datasets Summary

3.2 Training

Fine-tuning Speed (Time per Epoch, Qwen3-1.7B Q8)

Quality Comparison vs PyTorch

4. Future Work

5. Conclusion
References

KEY HIGHLIGHTS

Tether Data's AI research division introduces QVAC-fabric-llm, a unified framework and solution that integrates a complete Low-Rank Adaptation (LoRA) fine-tuning workflow directly into the llama.cpp ecosystem. This marks a significant step in the QVAC project's mission, creating the first solution for parameter-efficient fine-tuning that works seamlessly across the entire consumer hardware ecosystem.
Democratizing True Cross-Platform Fine-Tuning: Building on the QVAC mission to democratize AI, this work delivers genuine hardware-agnostic compatibility. It breaks the dependency on specific vendors, empowering the global AI community to perform fine-tuning on any modern device, from mobile to desktop and server.
Unlocking On-Device Personalization with QVAC-fabric-llm: We demonstrate the first successful fine-tuning runs on mobile GPUs, a previously unsupported capability unlocked by our solution. This breakthrough enables true on-device personalization and instruction-tuning, a critical step for aligning models on consumer hardware.
Enabling Modern LLM Architectures: The QVAC-fabric-llm framework brings fine-tuning support for state-of-the-art models (Qwen3 and Gemma3) to llama.cpp, providing the first cross-platform support for these modern architectures and expanding the scope of models that can be adapted by the community.
Empowering the Community with Open Resources: To accelerate development and innovation, Tether Data is publicly releasing Multi-Platform Binaries and a collection of Fine-tuned Model Adapters. We are also providing the source code, which is currently a Work-in-Progress (WIP) and intended for experimental use, to enable developers to extend the solution for other LLM models.

All code contributions are upstream-safe: we did not modify any existing llama.cpp public APIs and instead introduced new additional APIs and modules. This ensures full compatibility with the upstream codebase and positions our work for seamless upstream integration in the near future, enabling the broader community to benefit from native llama.cpp support. The code is made available under the Apache 2.0 license and is intended to allow researchers and developers to immediately begin building and testing custom models across all supported hardware.

Copyright Complaints: We will take appropriate actions in response to notice of copyright infringement. If you believe your work has been used or copied in a manner that infringes upon your intellectual property rights, please email [email protected] identifying and describing both the copyrighted work and alleged infringing content to file a notice of infringement.

Overview

The ability to fine-tune Large Language Models (LLMs) on user-specific data is crucial for personalization and broader adoption. To preserve privacy, ensure operational continuity in high-latency regions (e.g., emerging markets), and provide an anti-fragile, highly resilient, and scalable AI platform, it is desirable that this fine-tuning occurs locally on consumer devices. However, existing on-device fine-tuning solutions are limited: they lack GPU acceleration or are restricted to specific vendor ecosystems, failing to support the diverse range of consumer-grade and mobile hardware.

Tether Data, S.A. de C.V. (Tether Data, we, us, our) presents a portable, LoRA-based fine-tuning solution that operates across the full spectrum of consumer GPU architectures. By integrating fine-tuning directly into a cross-platform inference engine and leveraging a portable graphics API, our solution enables efficient training on diverse hardware, from smartphones (with Mali, Adreno, and Apple GPUs) to desktops, laptops and servers (with AMD, Intel, NVIDIA, and Apple GPUs). We also implemented the necessary architectural extensions and operator designs to support LoRA LLM fine-tuning for modern transformer models like Qwen3 and Gemma3.

We validate our approach with two real-world applications: email style transfer and biomedical question answering. The results demonstrate successful on-device fine-tuning across all tested platforms. By transforming LLM fine-tuning from a vendor-specific capability into a cross-platform solution, our work represents a significant step toward democratizing LLM personalization and making this technology accessible to a wide audience of end users.

1. Introduction

The growing capabilities of Large Language Models (LLMs) have intensified the need for efficient fine-tuning methods to adapt these general-purpose models to specific tasks and domains. While full fine-tuning (which updates all of a model's parameters) offers maximum flexibility, its prohibitive computational and memory costs render it impractical for most users.

1.1 Parameter-Efficient Fine-Tuning

This challenge has accelerated the development of Parameter-Efficient Fine-Tuning (PEFT) methods, which achieve strong performance by updating only a small fraction of a model's weights. Among PEFT techniques, Low-Rank Adaptation (LoRA) has emerged as a dominant standard. By freezing a model's pre-trained weights and injecting trainable, low-rank matrices into its transformer layers, LoRA reduces the number of trainable parameters by orders of magnitude, making fine-tuning feasible on consumer-grade hardware. However, a significant accessibility gap remains: LoRA’s practical implementation is restricted to a limited set of GPU architectures, preventing broader use of the parallel processing power available in modern mobile GPUs. Most open-source finetuning frameworks, such as PEFT, Unsloth, and Llama factory, are built upon PyTorch or JAX and primarily target NVIDIA's CUDA platform. While highly effective, this focus perpetuates hardware dependency.

1.2 LLAMA.CPP

The llama.cpp project has become the de-facto library for efficient, cross-platform LLM inference, supporting a wide range of hardware from Windows, macOS, and Linux to mobile devices. Nevertheless, the existing implementation is constrained by several shortcomings:

It relies exclusively on full fine-tuning, a method whose computational and memory demands are prohibitive for consumer-grade hardware.
The system operates only on raw text tokens and does not support structured data formats, which prevents the instruction-tuning process necessary for teaching models to follow user commands and develop assistant-like capabilities.
Its fine-tuning support is limited to the CPU, failing to harness the parallel processing power of mobile GPUs. This architectural choice leads to excessively slow training performance as the most capable processing units of modern consumer hardware is not leveraged.

1.3 Our Contributions

Our work addresses the logical next step: democratizing LoRA fine-tuning across this diverse hardware ecosystem. We present the integration of LoRA directly into the llama.cpp framework. To achieve true cross-platform compatibility, we leverage the Vulkan graphics and compute API, which provides a unified programming interface across virtually all modern GPU vendors, including desktop (AMD, Intel, NVIDIA) and mobile (Qualcomm Adreno, ARM Mali, Apple) platforms. Our work makes the following key contributions:

A Unified Framework for Consumer Hardware: We integrate a complete Low-Rank Adaptation (LoRA) fine-tuning workflow into llama.cpp, providing APIs to initialize, train, checkpoint, resume training, save and merge adapters across multiple quantization and precisions. This creates the first solution for parameter-efficient fine-tuning that works across heterogeneous hardware, from desktop to mobile. The included ability to merge adapters and run inference on device facilitates running custom fine-tuned models on all devices including android GPUs.
Enabling Modern LLM Architectures: We implement optimized GPU kernels and backward passes to bring fine-tuning for state-of-the-art models (Qwen3 and Gemma3) to llama.cpp, providing the first cross-platform support for these architectures on consumer hardware.
On-Device Instruction-Tuning: We introduce a masked-loss training objective to llama.cpp, enabling effective instruction-tuning. This allows models to learn directly from target responses, a critical capability for aligning models to follow instructions on consumer devices.

Through use cases in email style transfer and biomedical Q&A, we demonstrate our solution's core capabilities:

On-Device Fine-Tuning on Mobile GPUs: We demonstrate the first successful fine-tuning on mobile GPUs (Adreno, Mali, Apple), a previously unsupported capability that unlocks true on-device personalization.
Seamless Cross-Platform Scalability: Our solution provides universal compatibility across the entire desktop GPU ecosystem, including AMD, Intel, NVIDIA, and Apple architectures.

To enable the community to build upon this work, we are publicly releasing the following resources:

Multi-Platform Binaries: Pre-built fine-tuning binaries for all supported platforms are available on our GitHub repository. The repo includes datasets and quick-start commands to begin fine-tuning on all supported hardware immediately.

🚀 Access QVAC-fabric-llm Finetuning Binaries

Access the first truly cross-platform LoRA fine-tuning solution for Large Language Models

🔗 Get access now

Fine-tuned Model Adapters : Model Adapters fine-tuned on device using our finetuning solution. These are available on our HuggingFace page and allow for on-device merging and inference to test efficacy of the fine-tuning.

🚀 Download QVAC-fabric-llm Finetuned Model Adapters

Access the adapters and test LORA Adapter merging and inference on device

🔗 Get the Adapters

Source code for new LLM extensions: We are also providing the source code to enable developers to extend the solution for other LLM models. All contributions were explicitly engineered to be upstream-safe. Because we did not modify any existing llama.cpp public APIs and only introduced new additional APIs and modules, we plan to upstream these contributions in the near future, allowing the community to benefit from native llama.cpp support.

This work democratizes LLM fine-tuning by breaking its dependency on specific hardware vendors, offering a cross-platform solution for practitioners, developers, and end-users alike.

2. Cross-Platform Training for Heterogeneous GPUs

2.1 LoRA

Architecture Overview

Figure 1 : We integrated the Low-Rank Adaptation (LoRA) module in llama.cpp augments the original pretrained weight matrix W with a low-rank update, scaled by alpha. This is expressed as:

W' = (W + AB) * alpha/r

Updated Weight Matrix (W'): Used during fine-tuning and inference.
Pretrained Weight Matrix (W): The original, fixed weight matrix.
Low-Rank Matrices (A and B): These are the only components updated during training where A is a d x r matrix and B is an r x d matrix.
Rank (r): The inner dimension of the low-rank matrices, where r << d. A lower rank leads to smaller update matrices and fewer trainable parameters.
Scaling Factor (alpha): A hyperparameter used to scale the low-rank update AB.

Key Features

The LoRA adapters are applied to all linear layers within the transformer blocks. This includes the query, key, value, and output projections in the self-attention mechanism, as well as the linear layers in the feed-forward network (FFN). This comprehensive application was chosen to maximize the adaptive capacity of the finetuning process. To manage this process, we introduced a set of functions into the llama.cpp public API:

llama_lora_training_init(): Initializes and allocates memory for a LoRA adapter, creating the tensors for matrices A and B.
llama_opt_init(): Configures the training optimizer. We made modifications to accept a parameter filter that freezes the base model weights and targets only the LoRA tensors for gradient updates.
llama_opt_epoch(): Executes a single training epoch, performing both forward and backward passes to update the LoRA adapter weights.
llama_lora_save_adapter(): Serializes the trained LoRA weights (A and B) and saves them to a separate GGUF file for runtime use.

Figure 2: How inputs are transformed through LoRA adapters, added to the frozen base model weights W, and then forwarded as outputs.

Implementation

Our choice of the Vulkan backend was strategic for achieving true cross-platform support. Unlike proprietary APIs like CUDA or Metal, Vulkan is a modern, low-level, vendor-agnostic standard that provides direct control over GPU hardware across a vast ecosystem. This includes all major desktop (NVIDIA, AMD, Intel) and mobile (Qualcomm Adreno, ARM Mali) vendors, making it the ideal choice to democratize fine-tuning. To leverage this backend for robust training, several foundational capabilities had to be added to llama.cpp:

Expanded Data Type Support: We extended backward pass support to compute gradients not only in float32 (the standard for stability and precision), but also in float16 (for reduced memory usage and faster GPU execution), int8, and int4 (to enable quantization-aware training and efficient fine-tuning under tight memory constraints). This makes llama.cpp flexible enough to support both high-precision training scenarios and lightweight fine-tuning on resource-limited devices.
New Operator Implementations: The OUT_PROD operator, critical for the LoRA backward pass, was implemented for CUDA and Vulkan backends (int8 and fp16), eliminating the need for graph splits during fine-tuning. With this change, LoRA training can now be executed fully on the GPU, not only with CUDA but also with Vulkan, enabling efficient cross-platform fine-tuning. This enhancement is essential for supporting state-of-the-art non-LLaMA models, such as Qwen and Gemma, where OUT_PROD is critical to efficient LoRA adaptation. Operator coverage added:
- OPT_STEP_ADAM: (Adam/AdamW step)
- SILU_BACK: (SiLU/Swish backward)
- RMS_NORM_BACK: (RMSNorm backward)
- SOFT_MAX_BACK: (softmax backward)
- ROPE_BACK: (RoPE backward)
- OUT_PROD: (outer product) with quantized and FP paths
Support for Modern Architectures (GEGLU): To extend fine-tuning capabilities beyond the LLaMA family, we implemented the backward pass for the GEGLU (GELU Gated Linear Unit) activation function. GEGLU is a key component in the feed-forward network of modern architectures like Google's Gemma. While llama.cpp already supported the GEGLU forward pass for inference, performing backpropagation to calculate gradients is impossible without its corresponding backward pass. Our implementation of this operator was therefore a prerequisite for enabling fine-tuning on Gemma and other state-of-the-art models. We also implemented the Cross-Entropy backward pass, which is essential for training models on classification-style tasks.
Apple Support: To bring LoRA fine-tuning to Apple GPUs (M-series and A-series), we implemented the missing backward and optimizer kernels in the ggml Metal backend as native MSL compute shaders. The design goal was parity with CUDA/Vulkan operator coverage, including quantized data paths. Because the kernels target Metal directly (not platform-specific GPU APIs), the implementation runs across the entire Apple GPU family, macOS (M-series) and iOS/iPadOS (A-series) without code changes. The same compute shaders, memory pipelines, and quantized kernels execute on both platforms, enabling mobile LoRA training on iPhones/iPads as well as desktops and laptops.

2.2 Low-Level Operators

New GGML Operations

GGML_OP_CROSS_ENTROPY_LOSS_MASKED	Computes cross-entropy loss only on unmasked tokens (assistant responses).
GGML_OP_CROSS_ENTROPY_LOSS_MASKED_BACK	Performs the backward pass for masked loss, propagating gradients only where the mask is active.
GGML_OP_COUNT_EQUAL_MASKED	Counts correct predictions only over masked (unmasked) tokens to compute meaningful accuracy metrics.

All operations are compatible with GPU backends, including Vulkan, and have dedicated shader implementations for efficient masked reductions and elementwise operations.

New Vulkan Shaders

These shaders are optimized for coalesced memory access and utilize fused multiply-add (FMA) instructions for high throughput.

Shader	Description
cross_entropy_loss_masked_back.comp	Computes masked cross-entropy gradients efficiently on GPU, handling both masked and unmasked tokens.
count_equal_masked.comp	Counts correct predictions in unmasked positions to compute accuracy metrics directly on GPU.

New GGML API Functions

GGML_API struct ggml_tensor * ggml_cross_entropy_loss_masked(..);
GGML_API struct ggml_tensor * ggml_count_equal_masked(..);
GGML_API struct ggml_tensor * ggml_opt_dataset_masks(..);

These new APIs expose masked operations and dataset functionality to downstream modules such as finetune-lora.cpp.

2.3 Instruction Fine-Tuning

As part of the effort to support instruction fine-tuning capabilities in llama.cpp, we implemented masked-loss training, where a mask is applied to train only on assistant tokens. This enables the model to focus exclusively on assistant responses while ignoring system and user prompts - a critical component for instruction-following model alignment.

Key Features

Masked Loss: Train only on assistant tokens, ignoring system/user prompts.
Count Equal Op: Computes accuracy only on masked assistant tokens for more meaningful metrics.
Chat Template: Supports built-in ChatML format and custom Jinja-based templates for flexible dataset preprocessing.
GPU Acceleration: Optimized Vulkan shaders for masked loss and accuracy operations, ensuring efficient fine-tuning on GPU backends.
Checkpointing (Model + Optimizer): Supports saving and loading checkpoints that include both model weights and optimizer state, enabling resumable training and reproducible fine-tuning workflows.
Flexible Rank & Scaling: Configurable rank (r) and scaling factor (α) per adapter for controlling parameter efficiency and adaptation strength.
Training Hyperparameters:
- Learning Rate Scheduler: Supports multiple scheduling strategies including cosine, constant, and linear decay.
- Warmup Steps: Allows gradual learning rate increase during early training for stability.
- Weight Decay: Provides configurable weight decay regularization for improved generalization.
- Optimizer Configuration: Fully integrated with checkpointing to preserve optimizer state (i.e., momentum buffers, adaptive learning rate values).
Merged Adapter Export: Ability to export the model with merged LoRA weights into a standalone .gguf model for inference.
Mixed Precision Training: Optional FP16 / FP32 adapter training to reduce memory usage.
Cross-Backend Compatibility: Unified LoRA interface working across CPU, Vulkan, Metal, and OpenCL backends.

Implementation

Masked Loss Computation
The centerpiece of the instruction fine-tuning implementation is masked loss, which ensures that only assistant responses contribute to the training objective. This is essential for instruction fine-tuning, where the goal is to optimize the model’s behavior as the assistant, not as the user or system. During dataset processing, each token in the sequence is annotated with a mask value indicating whether it should contribute to the loss function. Tokens corresponding to assistant messages are marked with 1, while all others (system or user tokens) are set to 0.

This design ensures that:

The model learns only from assistant responses.
User and system messages influence the context but not the loss.
The same tokenization and masking logic are used consistently during both dataset creation and loss computation.

Chat Template System

Instruction fine-tuning depends on consistent conversational formatting. The implementation includes a chat template system supporting both the ChatML format and custom Jinja templates, ensuring compatibility with Hugging Face datasets and tokenizer pipelines. The ChatML format provides a structured markup separating roles (system, user, assistant) and ensures interoperability with ChatML-compatible datasets.

Integrates with the existing common_chat_templates system.
Supports Hugging Face–compatible {{role}} and {{content}} placeholders.
Provides a fallback to ChatML if the custom template fails to render.
Enables alignment with different model families (e.g., OpenAI, Anthropic, Mistral, Qwen).

This flexible system allows training across diverse datasets without modifying the core pipeline.

Data Processing

The dataset pipeline handles preprocessing from JSONL inputs to fully tokenized, masked and padded tensors. It ensures consistent formatting and masking across various dataset sources.

Input Format Example

{
"messages": [  
    {"role": "system", "content": "You are a helpful assistant."},     
    {"role": "user", "content": "What is the capital of France?"},     
    {"role": "assistant", "content": "The capital of France is Paris."}
  ]
}

Usage Examples

Basic Instruction Fine-tuning

# Using built-in ChatML template
./build/bin/llama-finetune-lora \
    -m model.gguf \
    -f conversations.jsonl \
    --assistant-loss-only \
    --lora-rank 16 \
    --lora-alpha 32 \
    -c 512 -b 128 -ub 128 \
    -ngl 999 -fa off

Custom Chat Template

# Using custom Jinja template
./build/bin/llama-finetune-lora \
    -m model.gguf \
    -f conversations.jsonl \
    --assistant-loss-only \
    --chat-template custom_template.jinja \
    --lora-modules "attn_q,attn_k,attn_v,attn_o" \
    -c 128 -b 128 -ub 128 \
    -ngl 999 -fa off

These commands demonstrate how to enable masked-loss instruction fine-tuning with LoRA adapters, using either the built-in ChatML template or a custom dataset template.

2.4 Enabling Fine-Tuning for Resource-Constrained Devices

A primary challenge was enabling LoRA finetuning for models like Qwen3 on Qualcomm Adreno GPUs (e.g., Adreno 830) using Vulkan. The investigation focused on crashes in the Vulkan backend during MUL_MAT and OUT_PROD operations involving very large tensors for LoRA fine-tuning. The team initially simplified the shaders to create a minimal baseline, then developed minimal test cases in the llama.cpp suite to reproduce the crashes.

After confirming the issue was linked to complex indexing arithmetic on large buffers within a minimal GLSL shader and noting that the OpenCL backend passed the same tests, the problem was isolated to the Vulkan driver. The root cause was finally identified as an undocumented limit in the Adreno 830 Vulkan driver concerning the cumulative size of input and output buffers (SSBOs) for a single operator.

Solution: Dynamic Tiling

We implemented a dynamic tiling algorithm for MUL_MAT and OUT_PROD. Instead of performing one large matrix multiplication, the operation is broken down into smaller, independent tiles that respect the 128MiB memory limit.

Figure 3: Dynamic tiling solution for very large matrices.

The algorithm is as follows:

Input: A matrix multiplication operation with inputs A and B.
Calculate Tile Size: Based on the input shapes (M, N, K) and their data types, dynamically calculate the largest possible tile dimensions ($M_t$, $N_t$) such that the combined size of the input and output sub-tensors for a single operation remains below the 128MiB hardware limit.
Execute Tiled Computation: Iterate through the larger matrices, executing the MUL_MAT or OUT_PROD operator on one small tile at a time.
Assemble Output: Copy the result of each tiled operation into the correct offset of the final output tensor.

This approach allows llama.cpp to execute arbitrarily large matrix operations on the Adreno GPU without triggering the hardware limitation, with the tile sizes adapting dynamically to different models and data types.

3 Results

The primary result of this work is the successful enablement of LoRA finetuning for modern LLMs (Qwen3, Gemma) on cross-platform GPUs. The fine-tuned models are made available on the same license terms as the original model to facilitate a review of our work, software, and support research: Gemma models on the Gemma Terms of Use and Qwen models on the Apache 2.0 license.

We validated our work across a range of hardware and models on multiple datasets to evaluate unstructured finetuning and instruction finetuning:

Tested Models: Gemma-1b, Qwen3-0.6B, TinyLlama-1.1B
Tested GPU Platforms:
- Qualcomm Adreno 830 (Vulkan)
- ARM Mali-G715 (Vulkan)
- NVIDIA, AMD, Intel, and Apple GPUs

3.1 Datasets Summary

We evaluate cross-backend LoRA fine-tuning with two complementary corpora chosen to stress different aspects of the training stack while remaining lightweight enough for mobile/edge GPUs.

Unstructured, conversational text (synthetic “personal email”) to analyse unstructured/raw text fine-tuning for style transfer and format structure.
Structured, biomedical yes/no questions with canonical labels to analyse instruction fine-tuning on medical questions adhering to classification-style losses, balanced sampling, and strict reproducibility.

Both datasets are synthetically generated to limit the likelihood of the inclusion of PII. The synthetic email data set is made available under the CC-BY-NC 4.0 (Creative Commons Attribution–Non Commercial 4.0), which is license-compatible for research, and preprocessed into compact JSONL. The structured, biomedical yes/no questions were specific entries in the PubMedQA dataset as described herein. The PubMedQA dataset was made available under the MIT license at the time it was accessed. All splits are stratified where applicable (seed=42) and exported with a MANIFEST.json for counts and provenance.

Unstructured, conversational text — Synthetic “Personal Email”

This is a 100% machine-generated US-English corpus created in-session from constrained prompts; no scraping, no mailbox ingestion. Intended for style-transfer/formatting robustness rather than factual knowledge.

Content & Format: 200 emails with fields {id, subject, body}. Subjects include varied surface forms (e.g., Re:/Fwd:, lowercase, single-emoji); bodies are compact (<300 words), casual voice, occasional quoted reply lines (> …), and generic venues (“the park,” “the gym”) to avoid identifiers.

Safety & Licensing: Synthetically generated to limit the likelihood of real names, addresses, companies, or unique venues. Intended for research only. Made available under the CC-BY-NC 4.0 (Creative Commons Attribution–Non Commercial 4.0).

Structured — Biomedical Yes/No (PubMedQA-derived)

Sources & Access. PubMedQA labeled (~1k; pqa_labeled) and artificial (~80k; pqa_artificial) subsets accessed programmatically via Hugging Face (qiaojin/PubMedQA). MIT-licensed per dataset card. Fields used: question, final_decision (Yes/No/Maybe), long_answer.

3.2 Training

Style Transfer: Email Data

To validate core causal language modelling capabilities, we fine-tuned Qwen-1.7B on an email corpus using standard next-token prediction. This experiment validates:

Unmasked autoregressive training works as expected
Standard causal language modelling without instruction masking
Style transfer capability where the model learns to generate text in a specific writing style
LoRA produces expected convergence behavior and style adaptation

Figure 4: Training loss on Qwen-1.7B - email data.

Instruction Fine-Tuning: Biomedical Data

To validate that LoRA training inside llama.cpp behaves equivalently to established framework workflows (e.g., PyTorch + HuggingFace), we performed a small-scale instruction-tuning experiment on a biomedical Q&A dataset. The goal of this evaluation was not to maximize accuracy, but to demonstrate that:

Instruction-tuning and masked-loss logic operate correctly
LoRA weight updates apply consistently across devices
Training convergence behavior matches PyTorch
The model meaningfully adapts to a specialized domain, even with limited data

Figure 5: Training loss on Qwen-1.7B - biomedical data.

Fine-tuning Speed (Time per Epoch, Qwen3-1.7B Q8)

In Table 1 below, we show the time per epoch (column 1) and total training time (column 2) for various hardware configurations.

Hardware	Time/Epoch	Full Training (8 epochs)
RTX 4090	5.5 min	45 min
AMD 7900 XTX	13 min	1.7 hrs
Intel Arc A770	20 min	2.7 hrs
Apple M3 Pro	40 min	5.3 hrs
Adreno 830	1h 40min	13 hrs
Mali G715	7h 40min	61 hrs

Table 1: Running times for fine-tuning on different architectures.

📊 View complete benchmarks with detailed metrics across all platforms

Quality Comparison vs PyTorch

Table 2 below shows the model quality evaluation in terms of Win Rate (which answer is judged as superior by a capable LLM judge?), accuracy (which model has a better biomedical knowledge?) and cosine similarity (compared to a reference LLM output, how similar are the outputs from PyTorch and QVAC fine-tuned LLMs?). The conclusion: near-parity quality with established frameworks, but works on 8x more hardware platforms.

Metric	QVAC-fabric-llm	PyTorch/HuggingFace
LLM-as-Judge Win Rate	45-48%	52-55%
Biomedical Accuracy	79-94%	78-86%
Cosine Similarity	0.82	0.77

Table 2: Quality comparison metrics for PyTorch and QVAC models.

Key Takeaways

The model showed consistent domain-adaptation behavior across all GPUs tested, from mobile (Mali, Adreno) to desktop (Intel, AMD, Apple) to datacenter-class NVIDIA GPUs. llama.cpp’s LoRA pipeline produces the same functional model adaptation patterns seen in PyTorch, even at small scale, validating correctness of:

LoRA weight injection & update flow
Masked-loss instruction-training path
Cross-entropy backward kernels
Vulkan + Metal gradient paths
Q4/Q8 quantized training behavior

Importantly, this biomedical task highlights the broader utility of portable fine-tuning: the ability to adapt models for high-stakes, knowledge-intensive environments such as healthcare, scientific research, and regulated enterprise applications, even on devices that traditionally have not been considered “training-capable”. By enabling consistent LoRA training across NVIDIA, AMD, Intel, Apple Silicon, and mobile GPUs, we make domain adaptation accessible in contexts where data privacy and locality are crucial. This means sensitive datasets never need to leave the user’s device or institution, supporting compliance-driven deployment models.

4. Future Work

Future work will focus on advancing the framework's efficiency and model support through several key avenues. We plan to expand quantization support by integrating formats such as GPTQ-INT8 and Q5_K_M, which offer superior trade-offs between computational speed and model fidelity. Kernel optimizations will continue, with efforts to enhance cache locality in the OUT_PROD shader and tailor workgroup parameters for core operations on mobile GPUs. We will also pursue lower-overhead memory management by eliminating staging buffers and adopting bindless descriptors to minimize CPU contention. Finally, we will investigate advanced compiler-level optimizations, such as operator fusion on Adreno architectures, to further increase training throughput and hardware utilization.

5. Conclusion

We present a unified, cross-platform framework that successfully enables parameter-efficient training of modern LLMs with LoRA on consumer hardware such as mobile SoCs and desktop GPUs, without relying on a CUDA-only ecosystem. We leverage Vulkan for cross-vendor acceleration (Mali, Adreno, Intel, AMD, NVIDIA) and Metal for Apple platforms, enabling fine-tuning across heterogeneous devices with a consistent user-facing API and training interface. Our contributions include the development of critical GPU kernels and backward passes to support SOTA architectures like Qwen3 and Gemma3, and the introduction of a masked-loss objective for effective on-device instruction-tuning. Furthermore, we overcome the fundamental barrier of mobile fine-tuning through a novel dynamic tiling method that manages severe memory constraints. Collectively, these innovations break the long-standing hardware limitations, as demonstrated by the first successful fine-tuning on mobile GPUs and universal compatibility across desktop architectures. The results validated through our use cases confirm that high-quality, local and private fine-tuning is no longer confined to powerful data centers but is now a viable and accessible capability for the broad ecosystem of consumer-grade hardware, thereby enabling the way for a new generation of personalized, highly-resilient, anti-fragile, and privacy-preserving on-device AI applications.

References

"llama.cpp." GitHub, ggml-org, https://github.com/ggml-org/llama.cpp.

Jin, Qiao, et al. "Pubmedqa: A dataset for biomedical research question answering." Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 2019.

Vulkan Graphics and Compute API. Khronos Group, https://www.vulkan.org/.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote