--- language: - multilingual - en license: apache-2.0 pipeline_tag: text-classification --- # LLM hallucination detector The LLM hallucination detector based on the hierarchical [XLM-RoBERTa-XL](https://huggingface.co/facebook/xlm-roberta-xl) was developed to participate in the [SemEval-2024 Task-6 - SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes](https://helsinki-nlp.github.io/shroom) (model-agnostic track). ## Model description Text... ## Intended uses & limitations This model is primarily aimed at being reference-based detected of hallucination in LLM without any additional information about LLM type and architecture (i.e. in model-agnostic mode). The reference-based detection means that the hallucination detector considers not only the human question and the answer generated by the verified LLM, but also the reference answer to the human question. Therefore, in a situation where the reference answer is not known, this hallucination detector is not applicable. But in some cases (for example, when we analyze the LLM's responses on an annotated test set and want to separate hallucinations from usual errors such as undergeneration, errors related to part of speech, and so on), we know information about the standards, and then the proposed detector will be extremely useful. ## Usage You need to install the [pytorch-metric-library](https://github.com/KevinMusgrave/pytorch-metric-learning) to use this model. After that, you can use this model directly with a pipeline for text classification: ```python from typing import Dict from transformers import pipeline import torch def sample_to_str(sample: Dict[str, str]) -> str: """ It converts a datapoint to an input text for an encoder-based classifier (like as RoBERTa). :param sample: the datapoint :return: the input text for the classifier (i.e. the LLM hallucination detector). """ possible_tasks = { 'PG', # paraphrase generation 'MT', # machine translation 'DM', # definition modeling } checked_llm_prediction = ' '.join(sample['hyp'].strip().split()) llm_task = sample['task'] if llm_task not in possible_tasks: raise ValueError(f'The task {llm_task} is not supported!') if llm_task == 'PG': context = ' '.join(sample['src'].strip().split()) united_prompt = 'The verified system\'s task is a paraphrase generation.' else: context = ' '.join(sample['tgt'].strip().split()) if llm_task== 'MT': united_prompt = 'The verified system\'s task is a machine translation.' else: united_prompt = 'The verified system\'s task is a definition modeling.' united_prompt += ' The sentence generated by the verified system: ' united_prompt += checked_llm_prediction if united_prompt[-1].isalnum(): united_prompt += '.' united_prompt += f' The generation context: {context}' if united_prompt[-1].isalnum(): united_prompt += '.' return united_prompt # The input data format is based on data for the model-agnostic track of SHROOM # https://helsinki-nlp.github.io/shroom input_data = [ { "hyp": "Resembling or characteristic of a weasel.", "ref": "tgt", "src": "The writer had just entered into his eighteenth year , when he met at the table of a certain Anglo - Germanist an individual , apparently somewhat under thirty , of middle stature , a thin and weaselly figure , a sallow complexion , a certain obliquity of vision , and a large pair of spectacles .", "tgt": "Resembling a weasel (in appearance).", "model": "", "task": "DM", "labels": [ "Hallucination", "Not Hallucination", "Not Hallucination", "Not Hallucination", "Not Hallucination" ], "label": "Not Hallucination", "p(Hallucination)": 0.2 }, { "hyp": "I thought you'd be surprised at me too.", "ref": "either", "src": "I thought so, too.", "tgt": "That was my general impression as well.", "model": "", "task": "PG", "labels": [ "Hallucination", "Hallucination", "Hallucination", "Hallucination", "Hallucination" ], "label": "Hallucination", "p(Hallucination)": 1.0 }, { "hyp": "You can go with me perfectly.", "ref": "either", "src": "Ты вполне можешь пойти со мной.", "tgt": "You may as well come with me.", "model": "", "task": "MT", "labels": [ "Not Hallucination", "Hallucination", "Hallucination", "Not Hallucination", "Hallucination" ], "label": "Hallucination", "p(Hallucination)": 0.6 } ] hallucination_detector = pipeline( task='text-classification', model='bond005/xlm-roberta-xl-hallucination-detector', framework='pt', trust_remote_code=True, device='cuda', torch_dtype=torch.float16 ) for sample in input_data: input_prompt = sample_to_str(sample) print('') print('==========') print(f' Task: {sample["task"]}') print(' Question for detector:') print(input_prompt) print('==========') print('TRUE') print(f' label: {sample["label"]}') print(f' p(Hallucination): {round(sample["p(Hallucination)"], 3)}') prediction = hallucination_detector(input_prompt)[0] predicted_label = prediction['label'] if predicted_label == 'Hallucination': hallucination_probability = prediction['score'] else: hallucination_probability = 1.0 - prediction['score'] print('PREDICTED') print(f' label: {predicted_label}') print(f' p(Hallucination): {round(hallucination_probability, 3)}') ``` ```text ========== Task: DM Question for detector: The verified system's task is a definition modeling. The sentence generated by the verified system: Resembling or characteristic of a weasel. The generation context: Resembling a weasel (in appearance). ========== TRUE label: Not Hallucination p(Hallucination): 0.2 PREDICTED label: Not Hallucination p(Hallucination): 0.297 ========== Task: PG Question for detector: The verified system's task is a paraphrase generation. The sentence generated by the verified system: I thought you'd be surprised at me too. The generation context: I thought so, too. ========== TRUE label: Hallucination p(Hallucination): 1.0 PREDICTED label: Hallucination p(Hallucination): 0.563 ========== Task: MT Question for detector: The verified system's task is a machine translation. The sentence generated by the verified system: You can go with me perfectly. The generation context: You may as well come with me. ========== TRUE label: Hallucination p(Hallucination): 0.6 PREDICTED label: Not Hallucination p(Hallucination): 0.487 ``` The Google Colaboratory version of [this script](https://colab.research.google.com/drive/1T5LOuYfLNI3bqz6W-Y6kEajk3SumxyqU?usp=sharing) is available too.