Upload README.md for answerability

#17
Files changed (1) hide show
  1. answerability/lora/README.md +390 -0
answerability/lora/README.md ADDED
@@ -0,0 +1,390 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ library_name: peft
7
+ library_name: transformers
8
+ ---
9
+
10
+ # Intrinsics for Answerability Classification
11
+
12
+ ## Model Summary
13
+ This is a RAG-specific family of intrinsics fine-tuned for binary answerability
14
+ classification task. The model takes as input a multi-turn conversation and a
15
+ set of documents, and classifies whether the user's final query is answerable or
16
+ unanswerable based on the available information in the documents.
17
+
18
+ We provide two intrinsics implemented as LoRA adapters (LoRA/aLoRA) trained over
19
+ Granite-3.3-2b-instruct, Granite-3.3-8b-instruct, and GPT-OSS 20b.
20
+
21
+ - **Developer:** IBM Research
22
+ - **Model type:** LoRA and aLoRA adapter for
23
+ [ibm-granite/granite-3.3-2b-instruct](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct),
24
+ [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct),
25
+ and [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b)
26
+ - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
27
+
28
+ ## Intended use
29
+ This is a family of intrinsincs that enables answerability classification for
30
+ the final user query in a multi-turn conversation, with respect to a set of
31
+ provided documents. The model is trained to determine whether the last user
32
+ query is answerable or unanswerable, based solely on the information present in
33
+ the documents. This makes it suitable for applications involving RAG and
34
+ document-grounded chatbots, where knowing whether sufficient information exists
35
+ to answer a query is crucial. The classification output from the answerability
36
+ model can be used in several downstream applications, including but not limited
37
+ to:
38
+ - Filter out unanswerable questions before sending them to generation in RAG
39
+ setting. By classifying a query as unanswerable upfront, the system can prevent
40
+ hallucinated or misleading responses.
41
+ - Re-query the retriever to get more
42
+ relevant documents. If a query is initially deemed unanswerable, the retriever
43
+ can be re-invoked with alternate formulations to fetch more relevant documents.
44
+
45
+ **Model input**: The input to the answerability intrinsic is an
46
+ OpenAI-compatible chat completion request, containing a list of conversation
47
+ turns that can alternate between the `user` and `assistant` role and ending with
48
+ a `user` turn, as well as list of documents.
49
+
50
+ **Model output**: The output of the answerability intrinsic is the result of the
51
+ original chat completion request formatted as a JSON object containing the
52
+ answerability likelihood score.
53
+
54
+ Please see the code snippets in the Quickstart Example section below for
55
+ examples that illustrate the intrinsic's input/output.
56
+
57
+ ## Quickstart Example
58
+
59
+ To run the answerability intrinsics through granite-common, you can either (a)
60
+ use an OpenAI-compatible inference backend, such as vLLM or (b) use the Hugging
61
+ Face transformers library. We provide below instructions for each of the two
62
+ approaches. Note that running inference using vLLM or another scalable
63
+ OpenAI-compatible inference backend should be significantly faster than using
64
+ the Hugging Face transformers library directly.
65
+
66
+ ### Using an OpenAI-Compatible Inference Backend
67
+
68
+ To run the intrinsic using an OpenAI-compatible inference backend, such as vLLM,
69
+ follow the steps below.
70
+
71
+ 1. Install the granite-common library:
72
+
73
+ pip install git+https://github.com/ibm-granite/granite-common.git
74
+ pip install granite_common[nltk]
75
+
76
+ 2. Install the Hugging Face CLI:
77
+
78
+ pip install -U "huggingface_hub[cli]"
79
+
80
+ 3. Install vLLM:
81
+
82
+ pip install vllm
83
+
84
+ 4. Download the intrinsics library:
85
+
86
+ hf download ibm-granite/rag-intrinsics-lib --local-dir ./rag-intrinsics-lib
87
+
88
+ 5. Edit the vLLM startup script found in `./rag-intrisics-lib/run_vllm.sh`
89
+ using your favorite editor:
90
+
91
+ Edit the constants `BASE_MODEL_NAME` and `BASE_MODEL_ORG` depending on the
92
+ base model on which the desired LoRA adapter has been trained. Optionally,
93
+ edit the constant `PORT` to change the port on which vLLM will run. Save the
94
+ modified file and exit the editor.
95
+
96
+ 6. Start vLLM through the startup script. The first time you run the script,
97
+ you may have to change the permissions to allow execution:
98
+
99
+ cd rag-intrinsics-lib
100
+ chmod u+x ./run_vllm.sh
101
+ ./run_vllm.sh &
102
+
103
+ 7. Run the following code snippet:
104
+
105
+ import json
106
+ import openai
107
+ import granite_common
108
+
109
+ intrinsic_name = "answerability"
110
+
111
+ # Change the following constant to select a different base model
112
+ base_model_name = "granite-3.3-8b-instruct"
113
+
114
+ # Change the following constants as needed to reflect the location of the vLLM server
115
+ # The selected port should be identical to the one you specified in the vLLM startup script
116
+ openai_base_url = "http://localhost:55555/v1"
117
+ openai_api_key = "rag_intrinsics_1234"
118
+
119
+ # Fetch IO configuration file from Hugging Face Hub
120
+ io_yaml_file = granite_common.intrinsics.util.obtain_io_yaml(
121
+ intrinsic_name, base_model_name
122
+ )
123
+
124
+ # Instantiate input/output processors
125
+ rewriter = granite_common.IntrinsicsRewriter(config_file=io_yaml_file)
126
+ result_processor = granite_common.IntrinsicsResultProcessor(config_file=io_yaml_file)
127
+
128
+ # Sample request
129
+ request_json = {
130
+ "messages": [
131
+ {
132
+ "role": "assistant",
133
+ "content": "Welcome to pet questions!"
134
+ },
135
+ {
136
+ "content": "What is the population of Australia?",
137
+ "role": "user"
138
+ }
139
+ ],
140
+ "extra_body": {
141
+ "documents": [
142
+ {
143
+ "doc_id": "1",
144
+ "text": "My dog has fleas."
145
+ },
146
+ {
147
+ "doc_id": "2",
148
+ "text": "My cat does not have fleas."
149
+ }
150
+ ]
151
+ }
152
+ }
153
+
154
+ # Add other parameters
155
+ request_json["model"] = intrinsic_name
156
+ request_json["temperature"] = 0.0
157
+
158
+ # Apply input processor
159
+ intrinsic_kwargs = {}
160
+ rewritten_request = rewriter.transform(request_json, **intrinsic_kwargs)
161
+
162
+ # Run inference
163
+ client = openai.OpenAI(base_url=openai_base_url, api_key=openai_api_key)
164
+ chat_completion = client.chat.completions.create(**rewritten_request.model_dump())
165
+
166
+ # Apply output processor
167
+ processed_chat_completion = result_processor.transform(
168
+ chat_completion, rewritten_request
169
+ )
170
+
171
+ # Verify that the contents of the completion is valid JSON and pretty-print the JSON.
172
+ parsed_contents = json.loads(processed_chat_completion.choices[0].message.content)
173
+ print("JSON output:")
174
+ print(json.dumps(parsed_contents, indent=2))
175
+
176
+ ### Using the Hugging Face Transformers Library
177
+
178
+ To run the intrinsic using the Hugging Face transformers library directly,
179
+ follow the steps below.
180
+
181
+ 1. Install the granite-common library:
182
+
183
+ pip install git+https://github.com/ibm-granite/granite-common.git
184
+ pip install granite_common[nltk]
185
+
186
+ 2. Install the Hugging Face CLI:
187
+
188
+ pip install -U "huggingface_hub[cli]"
189
+
190
+ 3. Install PEFT:
191
+
192
+ pip install peft
193
+
194
+ 4. Install xgrammar:
195
+
196
+ pip install xgrammar
197
+
198
+ 5. Run the following code snippet:
199
+
200
+ import json
201
+ import granite_common.util
202
+ import peft
203
+
204
+ intrinsic_name = "answerability"
205
+
206
+ # Change the following constant to select a different base model
207
+ base_model_name = "granite-3.3-8b-instruct"
208
+
209
+ use_cuda = True # Set to False to use default PyTorch device for this machine + model
210
+
211
+ # Fetch IO configuration file from Hugging Face Hub
212
+ io_yaml_file = granite_common.intrinsics.util.obtain_io_yaml(
213
+ intrinsic_name, base_model_name
214
+ )
215
+
216
+ # Fetch LoRA directory from Hugging Face Hub
217
+ lora_dir = granite_common.intrinsics.util.obtain_lora(
218
+ intrinsic_name, base_model_name
219
+ )
220
+
221
+ # Instantiate input/output processors
222
+ rewriter = granite_common.IntrinsicsRewriter(config_file=io_yaml_file)
223
+ result_processor = granite_common.IntrinsicsResultProcessor(config_file=io_yaml_file)
224
+
225
+ # Sample request
226
+ request_json = {
227
+ "messages": [
228
+ {
229
+ "role": "assistant",
230
+ "content": "Welcome to pet questions!"
231
+ },
232
+ {
233
+ "content": "What is the population of Australia?",
234
+ "role": "user"
235
+ }
236
+ ],
237
+ "extra_body": {
238
+ "documents": [
239
+ {
240
+ "doc_id": "1",
241
+ "text": "My dog has fleas."
242
+ },
243
+ {
244
+ "doc_id": "2",
245
+ "text": "My cat does not have fleas."
246
+ }
247
+ ]
248
+ }
249
+ }
250
+
251
+ # Add additional parameters
252
+ request_json["model"] = intrinsic_name
253
+ request_json["temperature"] = 0.0
254
+
255
+ # Apply input processor
256
+ intrinsic_kwargs = {}
257
+ rewritten_request = rewriter.transform(request_json, **intrinsic_kwargs)
258
+
259
+ # Load the base model and merge LoRA weights
260
+ model, tokenizer = granite_common.util.load_transformers_lora(lora_dir)
261
+ if use_cuda:
262
+ model = model.cuda()
263
+
264
+ # Convert the chat completion request into a the Transformers library's proprietary
265
+ # format.
266
+ generate_input, other_input = (
267
+ granite_common.util.chat_completion_request_to_transformers_inputs(
268
+ rewritten_request,
269
+ tokenizer,
270
+ model,
271
+ )
272
+ )
273
+
274
+ # Use the Transformers library's APIs to generate one or more completions,
275
+ # then convert those completions into OpenAI-compatible chat completion
276
+ responses = granite_common.util.generate_with_transformers(
277
+ tokenizer, model, generate_input, other_input
278
+ )
279
+
280
+ # Apply output processor
281
+ transformed_responses = result_processor.transform(responses, rewritten_request)
282
+
283
+ # Verify that the contents of the completion is valid JSON and pretty-print the JSON.
284
+ parsed_contents = json.loads(transformed_responses.choices[0].message.content)
285
+ print("JSON output:")
286
+ print(json.dumps(parsed_contents, indent=2))
287
+
288
+ ## Training Details
289
+
290
+ ### Training Data
291
+
292
+ The training data uses the publicly available Government corpus from
293
+ [MT-RAG](https://arxiv.org/pdf/2501.03468) as the source of documents. Based on
294
+ this corpus, we constructed a dataset consisting of a mix of human-created and
295
+ synthetically generated multi-turn conversations. It includes two types of
296
+ examples: (1) Answerable queries, where the final user question can be answered
297
+ based on the provided documents. These examples teach the adapter to recognize
298
+ when sufficient information is present to support an answer. (2) Unanswerable
299
+ queries, where the documents lack the necessary information to answer the final
300
+ user query. We used Mixtral as an automatic judge to validate the answerability
301
+ labels and filter out noisy samples.
302
+
303
+ #### Training Hyperparameters
304
+
305
+ The LoRA adapter was fine-tuned using PEFT under the following regime: rank =
306
+ 32, learning rate = 5e-6, number of epochs = 25, with early stopping based on
307
+ validation set, and 90/10 split between training and validation.
308
+
309
+ ## Evaluation
310
+
311
+ ### Answerability Classification
312
+
313
+ We evaluated the model on binary answerability classification using MT-RAG
314
+ Benchmark. In this setting, the model is given the full multi-turn conversation
315
+ history along with the supporting documents. This benchmark evaluates the
316
+ model's ability to assess answerability when the final user query can also
317
+ depend on prior turns for context. The following table presents results
318
+ comparing baselines and frontier models with task-specific answerability
319
+ intrinsics on the answerability classification task on MT-RAG data. The LoRAs
320
+ consistently outperform frontier models, converging near \~90% accuracy
321
+ regardless of base model size. Even small models like Granite 3.3-2B, once
322
+ fine-tuned, match or surpass much larger models, including GPT-4o. The
323
+ difference between LoRA and aLoRA is minimal, indicating both are effective
324
+ fine-tuning strategies.
325
+
326
+ | | Models | Unanswerable F1 | Answerable F1 | Classification Accuracy | Weighted F1 |
327
+ |:--------------------------------------------:|:----------------------------------------------:|:--------------------------:|:---------------------------:|:-------------------------------------:|:-------------------------:|
328
+ | Baselines | BigBird (pre-trained embeddings) w/ MLP | 73.4 | 65.2 | 69.8 | 69.6 |
329
+ | | llama2-7b as classifier (Full SFT) | 88.2 | 85.9 | 87.1 | 87.1 |
330
+ | Frontier Models out-of-the-box | Granite 3.3-2b-instruct | 48.7 | 70.4 | 62.4 | 58.7 |
331
+ | | Granite 3.3-8b-instruct | 62.8 | 65.2 | 64.5 | 63.9 |
332
+ | | GPT-OSS-20b | 77.3 | 58.3 | 70.7 | 68.5 |
333
+ | | GPT-OSS-120b | 70.2 | 68.9 | 69.8 | 69.6 |
334
+ | | GPT4o-mini | 82.7 | 78.1 | 80.8 | 80.6 |
335
+ | | GPT4o | 85.7 | 77.5 | 82.5 | 81.9 |
336
+ | Trained LoRAs/aLoRAs | Granite 3.3-2b LoRA | 91.2 | 89.6 | 90.4 | 90.5 |
337
+ | | Granite 3.3-8b LoRA | 91.1 | 90.3 | 90.6 | 90.7 |
338
+ | | GPT-OSS-20b LoRA | 91.6 | 89.8 | 90.8 | 90.8 |
339
+ | | Granite 3.3-2b aLoRA | 89.8 | 88.6 | 89.1 | 89.2 |
340
+ | | Granite 3.3-8b aLoRA | 90.1 | 89.6 | 89.5 | 89.9 |
341
+ | | GPT-OSS-20b aLoRA | 90.4 | 88.6 | 89.6 | 89.6 |
342
+
343
+
344
+ ### Comparing the Answerability Intrinsics vs. Vanilla Granite Models for Answer Quality
345
+
346
+ We compare the performance of Granite 3.3-2b, Granite 3.3-8b Instruct
347
+ vs. answerability intrinsics implemented as LoRA adapters on a subset of MT-RAG
348
+ Benchmark. In this setup, each query is paired with only 5 retrieved passages as
349
+ context.
350
+
351
+ - Answerability Classification Performance: The answerability intrinsics
352
+ outperform the vanilla model in overall F1 on both answerables and
353
+ unanswerables. The answerability intrinsics achieves higher recall on
354
+ unanswerable queries, making it better at identifying questions that should
355
+ not be answered. However, this comes at the cost of lower recall on answerable
356
+ queries.
357
+
358
+ - Joint Answerability-Faithfulness Score computed as: \> = 1 (if model
359
+ prediction = IDK/unanswerable ∩ ground truth = unanswerable)
360
+
361
+ > = RAGAS Faithfulness (if model prediction = non-IDK/answerable ∩ ground
362
+ > truth = answerable)
363
+
364
+ > = 0 (otherwise)
365
+
366
+ This score rewards the model for correctly abstaining on unanswerable queries
367
+ (full credit) and for providing faithful answers on answerable queries
368
+ (partial credit based on RAGAS Faithfulness). No credit is given for incorrect
369
+ or unfaithful predictions.
370
+
371
+ The answerability intrinsics for granite-2b and granite-8b achieves 8% and 13%
372
+ lifts on this metric respectively. This rewards the model for correctly
373
+ abstaining on unanswerable queries and for being faithful when it chooses to
374
+ answer.
375
+
376
+
377
+ | | F1 Score Unanswerable | F1 Score Answerable | Recall Unanswerable | Recall Answerable | Joint Answerability- Faithfulness Score |
378
+ |:-----------------------:|:---------------------:|:-------------------:|:-------------------:|:-----------------:|:---------------------------------------:|
379
+ | Granite 3.3-2b Instruct | 13 | 77 | 7 | 99 | 48 |
380
+ | Granite 3.3-2b LoRA | 48 | 78 | 37 | 89 | 56 |
381
+ | Granite 3.3-8b Instruct | 17 | 77 | 10 | 99 | 49 |
382
+ | Granite 3.3-8b LoRA | 65 | 81 | 60 | 86 | 62 |
383
+
384
+ ## Model Card Authors
385
+
386
+ [Vraj Shah](mailto:[email protected])
387
+
388
+ ### Framework versions
389
+
390
+ - PEFT 0.14.0