Model Card for Model ID
Model Description
This model card describes Llama-3-8B-CR-DPO, Lora-fine-tune using Confidence Reasoning Direct Preference Optimization (CR-DPO) as introduced in the paper Enhancing Large Language Model’ Situated Faithfulness To External Contexts. CR-DPO calibrates LLMs’ trust in external contexts by aligning their confidence in internal knowledge with the reliability of external information.
Method
The model learns verbalized confidence reasoning by optimizing preferences between pairs of self-sampled reasoning paths. When the model’s internal answer is incorrect, it is presented with a correct external context to reason why the context is right and its answer is wrong, forming a preferred reasoning path. Conversely, the model is misled with an incorrect external context, generating a rejected reasoning path. The same process is applied when the model’s internal answer is correct, comparing internal and external reasoning. To enhance reasoning diversity, dual sampling is used, generating two reasoning path pairs with varied prompts and in-context examples. Additionally, negative log-likelihood is incorporated into the DPO loss to further optimize reasoning. The training data is constructed from TriviaQA, NaturalQA, PopQA, and RedditQA.
Other Info
- Finetuned from model: Llama-3-8B-Instruct
- Repository: Github Link
- Paper: Enhancing Large Language Model’ Situated Faithfulness To External Contexts
BibTeX:
@article{Huang2024EnhancingLL,
title={Enhancing Large Language Models' Situated Faithfulness to External Contexts},
author={Yukun Huang and Sanxing Chen and Hongyi Cai and Bhuwan Dhingra},
journal={ArXiv},
year={2024},
volume={abs/2410.14675},
url={https://api.semanticscholar.org/CorpusID:273482717}
}
- Downloads last month
- 12