LoRA Adapter for Citation Generation

Welcome to Granite Experiments!

Think of Experiments as a preview of what's to come. These projects are still under development, but we wanted to let the open-source community take them for spin! Use them, break them, and help us build what's next for Granite – we'll keep an eye out for feedback and questions. Happy exploring!

Just a heads-up: Experiments are forever evolving, so we can't commit to ongoing support or guarantee performance.

Model Summary

This is a RAG-specific LoRA adapter for ibm-granite/granite-3.2-8b-instruct that is fine-tuned for the citation generation task. Given a multi-turn conversation between a user and an AI assistant ending with an assistant response and a set of documents/passages on which the last assistant response is supposed to be based, the adapter generates citations for the last assistant response from the provided documents/passages. The LoRA adapter has the following features:

Fine-grained citations: The adapter generates citations for each sentence in the assistant response (when available). Moreover, each citation consists of a set of sentences from the documents/passages that support the corresponding sentence in the assistant response.
Post-hoc citation generation: Since the adapter takes the assistant response as input, it can generate citations for responses generated by any LLM. Pick your favorite LLM and use the adapter to generate post-hoc citations!

Developer: IBM Research
Model type: LoRA adapter for ibm-granite/granite-3.2-8b-instruct
License: Apache 2.0

Intended use

This is a LoRA adapter that gives the ability to generate citations for the last assistant response in a multi-turn RAG conversation based on a set of provided documents/passages. It can be used to generate post-hoc citations for assistant responses generated by any LLM in a RAG setting.

Note: While you can invoke the adapter directly, as outlined below, we highly recommend calling it through granite-io, which wraps the model with a tailored I/O processor, enabling a friendlier development interface. The I/O processor takes care of several data transformation/validation tasks that would be otherwise required (incl. splitting the input documents and assistant response into sentences before calling the adapter as well as validating the adapter's output and transforming the returned sentence IDs into spans over the documents and the response).

Model input: The input to the model is conceptually a list of conversational turns ending with an assistant response and a list of documents converted to a string using the apply_chat_template function. For the adapter to work, the last assistant response as well as the documents should be pre-split into sentences. In more detail, the primary inputs are the following three items, each represented in JSON:

conversation: A list of conversational turns between the user and the assistant, where each item in the list is a dictionary with fields role and content. The role equals to either user or assistant, denoting user and assistant turns, respectively, while the content field contains the corresponding user/assistant utterance. The conversation should end with an assistant turn and the text field of that turn should contain the assistant utterance with each sentence prefixed with a response sentence ID of the form <rI>, where I is an integer. The numbering should start from 0 (for the first sentence) and be incremented by one for each subsequent sentence in the last assistant turn. Note that only the last assistant turn should be split into sentences as described above; earlier assistant turns (as well as all user turns) should be maintained in their original form.
instruction: A task instruction, which is encoded as a dictionary with fields role and content, where role equals to system and content equals to the following string describing the citation generation task: Split the last assistant response into individual sentences. For each sentence in the response, identify the statement IDs from the documents that it references. Ensure that your output includes all response sentence IDs, and for each response sentence ID, provide the corresponding referring document sentence IDs.
documents: A list of documents, where each item in the list is a dictionary with fields doc_id and text. The text field contains the text of the corresponding document with each sentence prefixed with a context sentence ID of the form <cI>, where I is an integer. The context sentence ID numbers should start from 0 (for the first sentence of the first document) and be incremented by one for each subsequent sentence. The numbers should continue to be incremented across documents to ensure that each context sentence ID appears once across the entire list of documents. For instance, if the last sentence of the 1st document has context sentence ID <cn>, then the first sentence of the 2nd document is expected to have ID <cn+1>.

To prompt the LoRA adapter, we combine the above components as follows: We first append the instruction to the end of the conversation to generate an augmented_conversation list. Then we invoke the apply_chat_template function with parameters: conversation = augmented_conversation and documents = documents.

Model output: When prompted with the above input, the model generates the citations for each sentence of the last assistant response in the form of a JSON dictionary. The dictionary is of the form {"<r0>": ..., "<r1>": ..., ...}, where each field <rI>, where I an integer, corresponds to the ID of the corresponding sentence in the last assistant response and its value is a list of context sentence IDs corresponding to the sentence(s) in the input documents that support the particular response sentence.

Quickstart Example

As explained above, it is highly recommended to use the LoRA adapter through granite-io. To get started, refer to the example notebook in the granite-io repository.

However, if you prefer to invoke the LoRA adapter directly, you can use the following code. Note that the code assumes that the documents and the last assistant response have been already split into sentences.

import torch
from transformers import AutoTokenizer,  AutoModelForCausalLM
from peft import PeftModel, PeftConfig
import json

BASE_NAME = "ibm-granite/granite-3.2-8b-instruct"
LORA_NAME = "ibm-granite/granite-3.2-8b-lora-rag-citation-generation"  
device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer = AutoTokenizer.from_pretrained(BASE_NAME, padding_side='left', trust_remote_code=True)
model_base = AutoModelForCausalLM.from_pretrained(BASE_NAME, device_map="auto")
model_citation = PeftModel.from_pretrained(model_base, LORA_NAME)

conversation = [
    {"role": "user", "content": "What is the visibility level of Git Repos and Issue Tracking projects?"}, 
    {"role": "assistant", "content": "<r0> Git Repos and Issue Tracking projects can have one of the following visibility levels: private, internal, or public. <r1> Private projects are visible only to project members, internal projects are visible to all users that are logged in to IBM Cloud, and public projects are visible to anyone. <r2> By default, new projects are set to private visibility level, which is the most secure for your data."}]

documents = [
    {"doc_id": 0, "text": "<c0> Git Repos and Issue Tracking is an IBM-hosted component of the Continuous Delivery service. <c1> All of the data that you provide to Git Repos and Issue Tracking, including but not limited to source files, issues, pull requests, and project configuration properties, is managed securely within Continuous Delivery. <c2> However, Git Repos and Issue Tracking supports various mechanisms for exporting, sending, or otherwise sharing data to users and third parties. <c3> The ability of Git Repos and Issue Tracking to share information is typical of many social coding platforms. <c4> However, such sharing might conflict with regulatory controls that apply to your business. <c5> After you create a project in Git Repos and Issue Tracking, but before you entrust any files, issues, records, or other data with the project, review the project settings and change any settings that you deem necessary to protect your data. <c6> Settings to review include visibility levels, email notifications, integrations, web hooks, access tokens, deploy tokens, and deploy keys. <c7> Project visibility levels \n\nGit Repos and Issue Tracking projects can have one of the following visibility levels: private, internal, or public. <c8> * Private projects are visible only to project members. <c9> This setting is the default visibility level for new projects, and is the most secure visibility level for your data. <c10> * Internal projects are visible to all users that are logged in to IBM Cloud. <c11> * Public projects are visible to anyone. <c12> To limit project access to only project members, complete the following steps:\n\n\n\n1. <c13> From the project sidebar, click Settings > General. <c14> 2. <c15> On the General Settings page, click Visibility > project features > permissions. <c16> 3. <c17> Locate the Project visibility setting. <c18> 4. <c19> Select Private, if it is not already selected. <c20> 5. <c21> Click Save changes. <c22> Project membership \n\nGit Repos and Issue Tracking is a cloud hosted social coding environment that is available to all Continuous Delivery users. <c23> If you are a Git Repos and Issue Tracking project Maintainer or Owner, you can invite any user and group members to the project. <c24> IBM Cloud places no restrictions on who you can invite to a project."},
    {"doc_id": 1, "text": "<c25> After you create a project in Git Repos and Issue Tracking, but before you entrust any files, issues, records, or other data with the project, review the project settings and change any settings that are necessary to protect your data. <c26> Settings to review include visibility levels, email notifications, integrations, web hooks, access tokens, deploy tokens, and deploy keys. <c27> Project visibility levels \n\nGit Repos and Issue Tracking projects can have one of the following visibility levels: private, internal, or public. <c28> * Private projects are visible only to project members. <c29> This setting is the default visibility level for new projects, and is the most secure visibility level for your data. <c30> * Internal projects are visible to all users that are logged in to IBM Cloud. <c31> * Public projects are visible to anyone. <c32> To limit project access to only project members, complete the following steps:\n\n\n\n1. <c33> From the project sidebar, click Settings > General. <c34> 2. <c35> On the General Settings page, click Visibility > project features > permissions. <c36> 3. <c37> Locate the Project visibility setting. <c38> 4. <c39> Select Private, if it is not already selected. <c40> 5. <c41> Click Save changes. <c42> Project email settings \n\nBy default, Git Repos and Issue Tracking notifies project members by way of email about project activities. <c43> These emails typically include customer-owned data that was provided to Git Repos and Issue Tracking by users. <c44> For example, if a user posts a comment to an issue, Git Repos and Issue Tracking sends an email to all subscribers. <c45> The email includes information such as a copy of the comment, the user who posted it, and when the comment was posted. <c46> To turn off all email notifications for your project, complete the following steps:\n\n\n\n1. <c47> From the project sidebar, click Settings > General. <c48> 2. <c49> On the **General Settings **page, click Visibility > project features > permissions. <c50> 3. <c51> Select the Disable email notifications checkbox. <c52> 4. <c53> Click Save changes. <c54> Project integrations and webhooks"}]

# Add system prompt
citation_sys_prompt = "Split the last assistant response into individual sentences. For each sentence in the response, identify the statement IDs from the documents that it references. Ensure that your output includes all response sentence IDs, and for each response sentence ID, provide the corresponding referring document sentence IDs."
conversation.append({"role": "system", "content": citation_sys_prompt})

# Generate answer
input_text = tokenizer.apply_chat_template(conversation=conversation, documents=documents, tokenize=False)
inputs = tokenizer(input_text, return_tensors="pt")
output = model_citation.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device), max_new_tokens=500)
output_text = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print("Output: ")
print(json.loads(output_text))

Training Details

The LoRA adapter was trained on synthetically-generated citation datasets. The process of generating the training data consisted of two main steps:

Multi-turn RAG conversation generation: Starting from publicly available document corpora, we generated a set of multi-turn RAG data, consisting of multi-turn conversations grounded on passages retrieved from the corpora. For details on the RAG conversation generation process please refer to the Granite Technical Report and Lee, Young-Suk, et al..
Citation generation: For each turn of the multi-turn RAG conversations from the previous step, we used a multi-step synthetic citation generation pipeline to generate citations for the assistant response.

The resulting data instances were used to train the LoRA adapter.

Training Data

The following public datasets were used as seed datasets for the multi-turn RAG conversation generation process:

CoQA - Wikipedia passages
MultiDoc2Dial
QuAC

Training Hyperparameters

The LoRA adapter was fine-tuned using PEFT under the following regime: rank = 8, learning rate = 1e-5, and 90/10 split between training and validation.

Evaluation

We evaluate the LoRA adapter on two citation benchmarks:

ALCE: Evaluates the ability of models to produce document/passage-level citations (i.e., identify the documents/passages that support a statement in the response).
LongBench-Cite: Evaluates the ability of models to produce fine-grained span-level citations (i.e., identify the spans within the input documents/passages that support a statement in the response) with a focus on long contexts.

Since the LoRA adapter is a post-hoc citation generation approach, its performance on the two benchmarks depends on the assistant responses for which it is asked to generate citations. To facilitate an apples-to-apples comparison, for each experiment, we keep the assistant responses the same and change the model that is used to generate the citations. In particular, we prompt an LLM to create an assistant response together with citations and evaluate the generated citations on the corresponding benchmark. Then, we compute and evaluate the citations generated for the same LLM response by the LoRA adapter.

Evaluation on ALCE

For the ALCE evaluation, we prompt Llama-3.1-70B-Instruct and Mixtral-8x22B-Instruct to generate both the assistant response and corresponding passage-level citations. We first calculate the performance of the citations generated by these models on ALCE. Subsequently, we feed the responses of these models (leaving out the citations) to the LoRA adapter and evaluate its generated citations. The results are shown in the table below:

Model used to generate response	Model used to generate citations	Recall	Precision	F1
Llama-3.1-70B-Instruct	Llama-3.1-70B-Instruct	61.4	58.1	59.7
Llama-3.1-70B-Instruct	Granite-3.2-8B LoRA citations	54.8	65.9	59.8
Mixtral-8x22B-Instruct	Mixtral-8x22B-Instruct	62.2	62.5	62.3
Mixtral-8x22B-Instruct	Granite-3.2-8B LoRA citations	54.3	69.5	61.0

We observe that the LoRA adapter performs on par with much bigger models when those are prompted to create passage-level citations. It is interesting to note that while the adapter's F1 performance is similar to the baselines, it exhibits a different precision-recall trade-off, trading lower recall for higher precision.

Notes:

All results are reported on the ELI5 dataset using the ORACLE (5-psg) setting.
To prompt Llama and Mixtral, we employ a setting similar to the one proposed in the ALCE paper; in particular we use a two-shot prompt comprised of two of the ICL examples from ALCE as well as a slightly modified version of the instruction from the paper.
Sentence splitting of context/response is performed using NLTK.
Finally, since ALCE expects passage-level citations, we elevate the finer-grained citations produced by the LoRA adapter to the passage level before running the ALCE evaluation.

Evaluation on LongBench-Cite

For the LonBench-Cite evaluation, we prompt Llama-3.1-70B-Instruct to generate both the assistant response and corresponding citations. Then we evaluate the citations generated by Llama as well as the post-hoc citations generated by the LoRA adapter when invoked on the Llama responses. The results are shown in the table below:

Model used to generate response	Model used to generate citations	Longbench-Chat (en)			MultifieldQA (en)			HotpotQA			GovReport
		R	P	F1	R	P	F1	R	P	F1	R	P	F1
Llama-3.1-70B-Instruct	Llama-3.1-70B-Instruct	27.0	34.4	26.1	46.1	63.3	49.7	34.0	39.4	30.2	55.0	77.5	62.0
Llama-3.1-70B-Instruct	Granite-3.2-8B LoRA citations	61.9	68.6	62.0	71.2	84.1	74.3	66.8	73.3	65.4	70.3	83.6	75.4

We observe that the LoRA adapter performs across the board significantly better than Llama-3.1-70B-Instruct when prompted to create span-level citations. This demonstrates the value of the adapter to create post-hoc citations even for assistant responses generated by much bigger LLMs.

Notes:

The evaluation results are reported on the English subset of LongBench-Cite (i.e., restricted to instances whose language field equals to en).
The results for the LoRA adapter do not include the performance for 4/585 tasks, which encountered out of memory errors.
To prompt Llama to generate a response with citations, we use the one-shot prompt described in the paper.
For the LoRA adapter, sentence splitting of the context is performed using NLTK. For the response, we reuse the splitting in Llama's output (since the LongBench-Cite prompt instructs the model to output a response split into sentences/statements).

Model Card Authors

Yannis Katsis
Chulaka Gunasekara

ibm-granite
/

granite-3.2-8b-lora-rag-citation-generation