ibm-granite
/

rag-intrinsics-lib

+---
+license: apache-2.0
+language:
+- en
+pipeline_tag: text-generation
+library_name: peft
+library_name: transformers
+---
+# Intrinsics for Answerability Classification
+## Model Summary
+This is a RAG-specific family of intrinsics fine-tuned for binary answerability
+classification task. The model takes as input a multi-turn conversation and a
+set of documents, and classifies whether the user's final query is answerable or
+unanswerable based on the available information in the documents.
+We provide two intrinsics implemented as LoRA adapters (LoRA/aLoRA) trained over
+Granite-3.3-2b-instruct, Granite-3.3-8b-instruct, and GPT-OSS 20b.
+- **Developer:** IBM Research
+- **Model type:** LoRA and aLoRA adapter for
+  [ibm-granite/granite-3.3-2b-instruct](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct),
+  [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct),
+  and [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b)
+- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
+## Intended use
+This is a family of intrinsincs that enables answerability classification for
+the final user query in a multi-turn conversation, with respect to a set of
+provided documents. The model is trained to determine whether the last user
+query is answerable or unanswerable, based solely on the information present in
+the documents. This makes it suitable for applications involving RAG and
+document-grounded chatbots, where knowing whether sufficient information exists
+to answer a query is crucial. The classification output from the answerability
+model can be used in several downstream applications, including but not limited
+to:
+- Filter out unanswerable questions before sending them to generation in RAG
+setting. By classifying a query as unanswerable upfront, the system can prevent
+hallucinated or misleading responses.
+- Re-query the retriever to get more
+relevant documents. If a query is initially deemed unanswerable, the retriever
+can be re-invoked with alternate formulations to fetch more relevant documents.
+**Model input**: The input to the answerability intrinsic is an
+OpenAI-compatible chat completion request, containing a list of conversation
+turns that can alternate between the `user` and `assistant` role and ending with
+a `user` turn, as well as list of documents.
+**Model output**: The output of the answerability intrinsic is the result of the
+original chat completion request formatted as a JSON object containing the
+answerability likelihood score.
+Please see the code snippets in the Quickstart Example section below for
+examples that illustrate the intrinsic's input/output.
+## Quickstart Example
+To run the answerability intrinsics through granite-common, you can either (a)
+use an OpenAI-compatible inference backend, such as vLLM or (b) use the Hugging
+Face transformers library. We provide below instructions for each of the two
+approaches. Note that running inference using vLLM or another scalable
+OpenAI-compatible inference backend should be significantly faster than using
+the Hugging Face transformers library directly.
+### Using an OpenAI-Compatible Inference Backend
+To run the intrinsic using an OpenAI-compatible inference backend, such as vLLM,
+follow the steps below.
+1.  Install the granite-common library:
+        pip install git+https://github.com/ibm-granite/granite-common.git
+        pip install granite_common[nltk]
+2.  Install the Hugging Face CLI:
+        pip install -U "huggingface_hub[cli]"
+3.  Install vLLM:
+        pip install vllm
+4.  Download the intrinsics library:
+        hf download ibm-granite/rag-intrinsics-lib --local-dir ./rag-intrinsics-lib
+5.  Edit the vLLM startup script found in `./rag-intrisics-lib/run_vllm.sh`
+    using your favorite editor:
+    Edit the constants `BASE_MODEL_NAME` and `BASE_MODEL_ORG` depending on the
+    base model on which the desired LoRA adapter has been trained. Optionally,
+    edit the constant `PORT` to change the port on which vLLM will run. Save the
+    modified file and exit the editor.
+6.  Start vLLM through the startup script. The first time you run the script,
+    you may have to change the permissions to allow execution:
+        cd rag-intrinsics-lib
+        chmod u+x ./run_vllm.sh
+        ./run_vllm.sh &
+7.  Run the following code snippet:
+        import json
+        import openai
+        import granite_common
+        intrinsic_name = "answerability"
+        # Change the following constant to select a different base model
+        base_model_name = "granite-3.3-8b-instruct"
+        # Change the following constants as needed to reflect the location of the vLLM server
+        # The selected port should be identical to the one you specified in the vLLM startup script
+        openai_base_url = "http://localhost:55555/v1"
+        openai_api_key = "rag_intrinsics_1234"
+        # Fetch IO configuration file from Hugging Face Hub
+        io_yaml_file = granite_common.intrinsics.util.obtain_io_yaml(
+            intrinsic_name, base_model_name
+        )
+        # Instantiate input/output processors
+        rewriter = granite_common.IntrinsicsRewriter(config_file=io_yaml_file)
+        result_processor = granite_common.IntrinsicsResultProcessor(config_file=io_yaml_file)
+        # Sample request
+        request_json = {
+             "messages": [
+                 {
+                 "role": "assistant",
+                 "content": "Welcome to pet questions!"
+                 },
+                 {
+                 "content": "What is the population of Australia?",
+                 "role": "user"
+                 }
+             ],
+             "extra_body": {
+                 "documents": [
+                 {
+                     "doc_id": "1",
+                     "text": "My dog has fleas."
+                 },
+                 {
+                     "doc_id": "2",
+                     "text": "My cat does not have fleas."
+                 }
+                 ]
+             }
+         }
+        # Add other parameters
+        request_json["model"] = intrinsic_name
+        request_json["temperature"] = 0.0
+        # Apply input processor
+        intrinsic_kwargs = {}
+        rewritten_request = rewriter.transform(request_json, **intrinsic_kwargs)
+        # Run inference
+        client = openai.OpenAI(base_url=openai_base_url, api_key=openai_api_key)
+        chat_completion = client.chat.completions.create(**rewritten_request.model_dump())
+        # Apply output processor
+        processed_chat_completion = result_processor.transform(
+            chat_completion, rewritten_request
+        )
+        # Verify that the contents of the completion is valid JSON and pretty-print the JSON.
+        parsed_contents = json.loads(processed_chat_completion.choices[0].message.content)
+        print("JSON output:")
+        print(json.dumps(parsed_contents, indent=2))
+### Using the Hugging Face Transformers Library
+To run the intrinsic using the Hugging Face transformers library directly,
+follow the steps below.
+1.  Install the granite-common library:
+        pip install git+https://github.com/ibm-granite/granite-common.git
+        pip install granite_common[nltk]
+2.  Install the Hugging Face CLI:
+        pip install -U "huggingface_hub[cli]"
+3.  Install PEFT:
+        pip install peft
+4.  Install xgrammar:
+        pip install xgrammar
+5.  Run the following code snippet:
+        import json
+        import granite_common.util
+        import peft
+        intrinsic_name = "answerability"
+        # Change the following constant to select a different base model
+        base_model_name = "granite-3.3-8b-instruct"
+        use_cuda = True  # Set to False to use default PyTorch device for this machine + model
+        # Fetch IO configuration file from Hugging Face Hub
+        io_yaml_file = granite_common.intrinsics.util.obtain_io_yaml(
+            intrinsic_name, base_model_name
+        )
+        # Fetch LoRA directory from Hugging Face Hub
+        lora_dir = granite_common.intrinsics.util.obtain_lora(
+            intrinsic_name, base_model_name
+        )
+        # Instantiate input/output processors
+        rewriter = granite_common.IntrinsicsRewriter(config_file=io_yaml_file)
+        result_processor = granite_common.IntrinsicsResultProcessor(config_file=io_yaml_file)
+        # Sample request
+        request_json = {
+             "messages": [
+                 {
+                 "role": "assistant",
+                 "content": "Welcome to pet questions!"
+                 },
+                 {
+                 "content": "What is the population of Australia?",
+                 "role": "user"
+                 }
+             ],
+             "extra_body": {
+                 "documents": [
+                 {
+                     "doc_id": "1",
+                     "text": "My dog has fleas."
+                 },
+                 {
+                     "doc_id": "2",
+                     "text": "My cat does not have fleas."
+                 }
+                 ]
+             }
+         }
+        # Add additional parameters
+        request_json["model"] = intrinsic_name
+        request_json["temperature"] = 0.0
+        # Apply input processor
+        intrinsic_kwargs = {}
+        rewritten_request = rewriter.transform(request_json, **intrinsic_kwargs)
+        # Load the base model and merge LoRA weights
+        model, tokenizer = granite_common.util.load_transformers_lora(lora_dir)
+        if use_cuda:
+            model = model.cuda()
+        # Convert the chat completion request into a the Transformers library's proprietary
+        # format.
+        generate_input, other_input = (
+            granite_common.util.chat_completion_request_to_transformers_inputs(
+                rewritten_request,
+                tokenizer,
+                model,
+            )
+        )
+        # Use the Transformers library's APIs to generate one or more completions,
+        # then convert those completions into OpenAI-compatible chat completion
+        responses = granite_common.util.generate_with_transformers(
+            tokenizer, model, generate_input, other_input
+        )
+        # Apply output processor
+        transformed_responses = result_processor.transform(responses, rewritten_request)
+        # Verify that the contents of the completion is valid JSON and pretty-print the JSON.
+        parsed_contents = json.loads(transformed_responses.choices[0].message.content)
+        print("JSON output:")
+        print(json.dumps(parsed_contents, indent=2))
+## Training Details
+### Training Data
+The training data uses the publicly available Government corpus from
+[MT-RAG](https://arxiv.org/pdf/2501.03468) as the source of documents. Based on
+this corpus, we constructed a dataset consisting of a mix of human-created and
+synthetically generated multi-turn conversations. It includes two types of
+examples: (1) Answerable queries, where the final user question can be answered
+based on the provided documents. These examples teach the adapter to recognize
+when sufficient information is present to support an answer. (2) Unanswerable
+queries, where the documents lack the necessary information to answer the final
+user query. We used Mixtral as an automatic judge to validate the answerability
+labels and filter out noisy samples.
+#### Training Hyperparameters
+The LoRA adapter was fine-tuned using PEFT under the following regime: rank =
+32, learning rate = 5e-6, number of epochs = 25, with early stopping based on
+validation set, and 90/10 split between training and validation.
+## Evaluation
+### Answerability Classification
+We evaluated the model on binary answerability classification using MT-RAG
+Benchmark. In this setting, the model is given the full multi-turn conversation
+history along with the supporting documents. This benchmark evaluates the
+model's ability to assess answerability when the final user query can also
+depend on prior turns for context. The following table presents results
+comparing baselines and frontier models with task-specific answerability
+intrinsics on the answerability classification task on MT-RAG data. The LoRAs
+consistently outperform frontier models, converging near \~90% accuracy
+regardless of base model size. Even small models like Granite 3.3-2B, once
+fine-tuned, match or surpass much larger models, including GPT-4o. The
+difference between LoRA and aLoRA is minimal, indicating both are effective
+fine-tuning strategies.
+|                                      |    Models |     Unanswerable     F1    |     Answerable        F1    |     Classification        Accuracy    |     Weighted        F1    |
+|:--------------------------------------------:|:----------------------------------------------:|:--------------------------:|:---------------------------:|:-------------------------------------:|:-------------------------:|
+|                   Baselines                  |     BigBird (pre-trained embeddings) w/ MLP    |             73.4           |             65.2            |                  69.8                 |            69.6           |
+|                                              |       llama2-7b   as classifier (Full SFT)     |             88.2           |             85.9            |                  87.1                 |            87.1           |
+|     Frontier   Models      out-of-the-box    |            Granite   3.3-2b-instruct           |             48.7           |             70.4            |                  62.4                 |            58.7           |
+|                                              |            Granite   3.3-8b-instruct           |             62.8           |             65.2            |                  64.5                 |            63.9           |
+|                                              |                   GPT-OSS-20b                  |             77.3           |             58.3            |                  70.7                 |            68.5           |
+|                                              |                   GPT-OSS-120b                 |             70.2           |             68.9            |                  69.8                 |            69.6           |
+|                                              |                    GPT4o-mini                  |             82.7           |             78.1            |                  80.8                 |            80.6           |
+|                                              |                      GPT4o                     |             85.7           |             77.5            |                  82.5                 |            81.9           |
+|          Trained        LoRAs/aLoRAs         |              Granite   3.3-2b LoRA             |             91.2           |             89.6            |                  90.4                 |            90.5           |
+|                                              |              Granite   3.3-8b LoRA             |             91.1           |             90.3            |                  90.6                 |            90.7           |
+|                                              |                GPT-OSS-20b   LoRA              |             91.6           |             89.8            |                  90.8                 |            90.8           |
+|                                              |              Granite   3.3-2b aLoRA            |             89.8           |             88.6            |                  89.1                 |            89.2           |
+|                                              |              Granite   3.3-8b aLoRA            |             90.1           |             89.6            |                  89.5                 |            89.9           |
+|                                              |               GPT-OSS-20b   aLoRA              |             90.4           |             88.6            |                  89.6                 |            89.6           |
+### Comparing the Answerability Intrinsics vs. Vanilla Granite Models for Answer Quality
+We compare the performance of Granite 3.3-2b, Granite 3.3-8b Instruct
+vs. answerability intrinsics implemented as LoRA adapters on a subset of MT-RAG
+Benchmark. In this setup, each query is paired with only 5 retrieved passages as
+context.
+- Answerability Classification Performance: The answerability intrinsics
+  outperform the vanilla model in overall F1 on both answerables and
+  unanswerables. The answerability intrinsics achieves higher recall on
+  unanswerable queries, making it better at identifying questions that should
+  not be answered. However, this comes at the cost of lower recall on answerable
+  queries.
+- Joint Answerability-Faithfulness Score computed as: \> = 1 (if model
+  prediction = IDK/unanswerable ∩ ground truth = unanswerable)
+  > = RAGAS Faithfulness (if model prediction = non-IDK/answerable ∩ ground
+  > truth = answerable)
+  > = 0 (otherwise)
+  This score rewards the model for correctly abstaining on unanswerable queries
+  (full credit) and for providing faithful answers on answerable queries
+  (partial credit based on RAGAS Faithfulness). No credit is given for incorrect
+  or unfaithful predictions.
+The answerability intrinsics for granite-2b and granite-8b achieves 8% and 13%
+lifts on this metric respectively. This rewards the model for correctly
+abstaining on unanswerable queries and for being faithful when it chooses to
+answer.
+|                         | F1 Score Unanswerable | F1 Score Answerable | Recall Unanswerable | Recall Answerable | Joint Answerability- Faithfulness Score |
+|:-----------------------:|:---------------------:|:-------------------:|:-------------------:|:-----------------:|:---------------------------------------:|
+| Granite 3.3-2b Instruct | 13                    | 77                  | 7                   | 99                | 48                                      |
+| Granite 3.3-2b LoRA     | 48                    | 78                  | 37                  | 89                | 56                                      |
+| Granite 3.3-8b Instruct |           17          |          77         |          10         |         99        | 49                                      |
+|   Granite 3.3-8b LoRA   |           65          |          81         |          60         |         86        | 62                                      |
+## Model Card Authors
+[Vraj Shah](mailto:[email protected])
+### Framework versions
+- PEFT 0.14.0