|
--- |
|
language: |
|
- vi |
|
library_name: transformers |
|
license: mit |
|
pipeline_tag: question-answering |
|
tags: |
|
- SemViQA |
|
- question-answering |
|
- fact-checking |
|
- information-retrieval |
|
--- |
|
|
|
# SemViQA-QATC: Vietnamese Question Answering Token Classifier |
|
|
|
## Model Description |
|
|
|
**SemViQA-QATC** is a component of the **SemViQA** system, fine-tuned from Vi-MRC to perform **Extractive Question Answering (QA)** and **evidence extraction** for fact-checking in Vietnamese. |
|
|
|
### **Model Information** |
|
- **Developed by:** [SemViQA Research Team](https://huggingface.co/SemViQA) |
|
- **Fine-tuned model:** [Vi-MRC](https://huggingface.co/nguyenvulebinh/vi-mrc-large) |
|
- **Supported Language:** Vietnamese |
|
- **Task:** Extractive QA, Evidence Extraction |
|
- **Dataset:** [ISE-DSC01](https://codalab.lisn.upsaclay.fr/competitions/15497) |
|
|
|
QATCForQuestionAnswering utilizes XLM-RoBERTa as a pre-trained language model. We further enhance it by incorporating a Token Classification mechanism, allowing the model to not only predict answer spans but also classify tokens as part of the rationale selection process. During training, we introduce Rationale Regularization Loss, which consists of sparsity and continuity constraints to encourage more precise and interpretable token-level predictions. This loss function ensures that the model effectively learns to identify relevant rationale tokens while maintaining coherence in token selection. |
|
|
|
## Usage Example |
|
|
|
Direct Model Usage |
|
```python |
|
# Install semviqa |
|
!pip install semviqa |
|
|
|
# Initalize a pipeline |
|
from transformers import AutoTokenizer |
|
from semviqa.ser.qatc_model import QATCForQuestionAnswering |
|
import torch |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("SemViQA/qatc-vimrc-isedsc01") |
|
model = QATCForQuestionAnswering.from_pretrained("SemViQA/qatc-vimrc-isedsc01") |
|
claim = "Chiแบฟn tranh vแปi Campuchia ฤรฃ kแบฟt thรบc trฦฐแปc khi Viแปt Nam thแปng nhแบฅt." |
|
context = "Sau khi thแปng nhแบฅt, Viแปt Nam tiแบฟp tแปฅc gแบทp khรณ khฤn do sแปฑ sแปฅp ฤแป vร tan rรฃ cแปงa ฤแปng minh Liรชn Xรด cรนng Khแปi phรญa ฤรดng, cรกc lแปnh cแบฅm vแบญn cแปงa Hoa Kแปณ, chiแบฟn tranh vแปi Campuchia, biรชn giแปi giรกp Trung Quแปc vร hแบญu quแบฃ cแปงa chรญnh sรกch bao cแบฅp sau nhiแปu nฤm รกp dแปฅng. Nฤm 1986, ฤแบฃng Cแปng sแบฃn ban hร nh cแบฃi cรกch ฤแปi mแปi, tแบกo ฤiแปu kiแปn hรฌnh thร nh kinh tแบฟ thแป trฦฐแปng vร hแปi nhแบญp sรขu rแปng. Cแบฃi cรกch ฤแปi mแปi kแบฟt hแปฃp cรนng quy mรด dรขn sแป lแปn ฤฦฐa Viแปt Nam trแป thร nh mแปt trong nhแปฏng nฦฐแปc ฤang phรกt triแปn cรณ tแปc ฤแป tฤng trฦฐแปng thuแปc nhรณm nhanh nhแบฅt thแบฟ giแปi, ฤฦฐแปฃc coi lร Hแป mแปi chรขu ร dรน cho vแบซn gแบทp phแบฃi nhแปฏng thรกch thแปฉc nhฦฐ tham nhลฉng, tแปi phแบกm gia tฤng, รด nhiแป
m mรดi trฦฐแปng vร phรบc lแปฃi xรฃ hแปi chฦฐa ฤแบงy ฤแปง. Ngoร i ra, giแปi bแบฅt ฤแปng chรญnh kiแบฟn, chรญnh phแปง mแปt sแป nฦฐแปc phฦฐฦกng Tรขy vร cรกc tแป chแปฉc theo dรตi nhรขn quyแปn cรณ quan ฤiแปm chแป trรญch hแป sฦก nhรขn quyแปn cแปงa Viแปt Nam liรชn quan ฤแบฟn cรกc vแบฅn ฤแป tรดn giรกo, kiแปm duyแปt truyแปn thรดng, hแบกn chแบฟ hoแบกt ฤแปng แปงng hแป nhรขn quyแปn cรนng cรกc quyแปn tแปฑ do dรขn sแปฑ." |
|
|
|
inputs = tokenizer(claim, context, return_tensors="pt", truncation=True, max_length=512) |
|
|
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
|
|
start_logits = outputs.start_logits |
|
end_logits = outputs.end_logits |
|
|
|
start_idx = torch.argmax(start_logits) |
|
end_idx = torch.argmax(end_logits) |
|
|
|
tokens = inputs["input_ids"][0][start_idx : end_idx + 1] |
|
evidence = tokenizer.decode(tokens, skip_special_tokens=True) |
|
print(evidence) |
|
# evidence: Sau khi thแปng nhแบฅt, Viแปt Nam tiแบฟp tแปฅc gแบทp khรณ khฤn do sแปฑ sแปฅp ฤแป vร tan rรฃ cแปงa ฤแปng minh Liรชn Xรด cรนng Khแปi phรญa ฤรดng, cรกc lแปnh cแบฅm vแบญn cแปงa Hoa Kแปณ, chiแบฟn tranh vแปi Campuchia, biรชn giแปi giรกp Trung Quแปc vร hแบญu quแบฃ cแปงa chรญnh sรกch bao cแบฅp sau nhiแปu nฤm รกp dแปฅng. |
|
``` |
|
|
|
Using TF-IDF and QATC Combination with Confidence Threshold |
|
```python |
|
import torch |
|
from transformers import AutoTokenizer |
|
from semviqa.ser.qatc_model import QATCForQuestionAnswering |
|
from semviqa.ser.ser_eval import extract_evidence_tfidf_qatc |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("SemViQA/qatc-vimrc-isedsc01") |
|
model = QATCForQuestionAnswering.from_pretrained("SemViQA/qatc-vimrc-isedsc01") |
|
|
|
claim = "Chiแบฟn tranh vแปi Campuchia ฤรฃ kแบฟt thรบc trฦฐแปc khi Viแปt Nam thแปng nhแบฅt." |
|
context = "Sau khi thแปng nhแบฅt, Viแปt Nam tiแบฟp tแปฅc gแบทp khรณ khฤn do sแปฑ sแปฅp ฤแป vร tan rรฃ cแปงa ฤแปng minh Liรชn Xรด cรนng Khแปi phรญa ฤรดng, cรกc lแปnh cแบฅm vแบญn cแปงa Hoa Kแปณ, chiแบฟn tranh vแปi Campuchia, biรชn giแปi giรกp Trung Quแปc vร hแบญu quแบฃ cแปงa chรญnh sรกch bao cแบฅp sau nhiแปu nฤm รกp dแปฅng. Nฤm 1986, ฤแบฃng Cแปng sแบฃn ban hร nh cแบฃi cรกch ฤแปi mแปi, tแบกo ฤiแปu kiแปn hรฌnh thร nh kinh tแบฟ thแป trฦฐแปng vร hแปi nhแบญp sรขu rแปng. Cแบฃi cรกch ฤแปi mแปi kแบฟt hแปฃp cรนng quy mรด dรขn sแป lแปn ฤฦฐa Viแปt Nam trแป thร nh mแปt trong nhแปฏng nฦฐแปc ฤang phรกt triแปn cรณ tแปc ฤแป tฤng trฦฐแปng thuแปc nhรณm nhanh nhแบฅt thแบฟ giแปi, ฤฦฐแปฃc coi lร Hแป mแปi chรขu ร dรน cho vแบซn gแบทp phแบฃi nhแปฏng thรกch thแปฉc nhฦฐ tham nhลฉng, tแปi phแบกm gia tฤng, รด nhiแป
m mรดi trฦฐแปng vร phรบc lแปฃi xรฃ hแปi chฦฐa ฤแบงy ฤแปง. Ngoร i ra, giแปi bแบฅt ฤแปng chรญnh kiแบฟn, chรญnh phแปง mแปt sแป nฦฐแปc phฦฐฦกng Tรขy vร cรกc tแป chแปฉc theo dรตi nhรขn quyแปn cรณ quan ฤiแปm chแป trรญch hแป sฦก nhรขn quyแปn cแปงa Viแปt Nam liรชn้ขใใๅ้กๅฎๆๆค้ฒๅ ฑ้ไบบ๊ถ๋ๆจฉใจ่ช็ฑ๋ฏผไบ." |
|
|
|
evidence = extract_evidence_tfidf_qatc( |
|
claim, context, model, tokenizer, device, confidence_threshold=0.5, length_ratio_threshold=0.6 |
|
) |
|
|
|
print(evidence) |
|
# evidence: sau khi thแปng nhแบฅt viแปt nam tiแบฟp tแปฅc gแบทp khรณ khฤn do sแปฑ sแปฅp ฤแป vร tan rรฃ cแปงa ฤแปng minh liรชn xรด cรนng khแปi phรญa ฤรดng cรกc lแปnh cแบฅm vแบญn cแปงa hoa kแปณ chiแบฟn tranh vแปi campuchia biรชn giแปi giรกp trung quแปc vร hแบญu quแบฃ cแปงa chรญnh sรกch bao cแบฅp sau nhiแปu nฤm รกp dแปฅng |
|
``` |
|
|
|
## **Evaluation Results** |
|
|
|
<table> |
|
<thead> |
|
<tr> |
|
<th colspan="2">Method</th> |
|
<th colspan="4">ISE-DSC01</th> |
|
</tr> |
|
<tr> |
|
<th>ER</th> |
|
<th>VC</th> |
|
<th>Strict Acc</th> |
|
<th>VC Acc</th> |
|
<th>ER Acc</th> |
|
<th>Time (s)</th> |
|
</tr> |
|
</thead> |
|
<tbody> |
|
<tr> |
|
<td rowspan="3">TF-IDF</td> |
|
<td>InfoXLM<sub>large</sub></td> |
|
<td>73.59</td> |
|
<td>78.08</td> |
|
<td>76.61</td> |
|
<td>378</td> |
|
</tr> |
|
<tr> |
|
<td>XLM-R<sub>large</sub></td> |
|
<td>75.61</td> |
|
<td>80.50</td> |
|
<td>78.58</td> |
|
<td>366</td> |
|
</tr> |
|
<tr> |
|
<td>Ernie-M<sub>large</sub></td> |
|
<td>78.19</td> |
|
<td>81.69</td> |
|
<td>80.65</td> |
|
<td>403</td> |
|
</tr> |
|
<tr> |
|
<td rowspan="3">BM25</td> |
|
<td>InfoXLM<sub>large</sub></td> |
|
<td>72.09</td> |
|
<td>77.37</td> |
|
<td>75.04</td> |
|
<td>320</td> |
|
</tr> |
|
<tr> |
|
<td>XLM-R<sub>large</sub></td> |
|
<td>73.94</td> |
|
<td>79.37</td> |
|
<td>76.95</td> |
|
<td>333</td> |
|
</tr> |
|
<tr> |
|
<td>Ernie-M<sub>large</sub></td> |
|
<td>76.58</td> |
|
<td>80.76</td> |
|
<td>79.02</td> |
|
<td>381</td> |
|
</tr> |
|
<tr> |
|
<td rowspan="3">SBert</td> |
|
<td>InfoXLM<sub>large</sub></td> |
|
<td>71.20</td> |
|
<td>76.59</td> |
|
<td>74.15</td> |
|
<td>915</td> |
|
</tr> |
|
<tr> |
|
<td>XLM-R<sub>large</sub></td> |
|
<td>72.85</td> |
|
<td>78.78</td> |
|
<td>75.89</td> |
|
<td>835</td> |
|
</tr> |
|
<tr> |
|
<td>Ernie-M<sub>large</sub></td> |
|
<td>75.46</td> |
|
<td>79.89</td> |
|
<td>77.91</td> |
|
<td>920</td> |
|
</tr> |
|
<tr> |
|
<th colspan="1">QA-based approaches</th> |
|
<th colspan="1">VC</th> |
|
<th colspan="4"></th> |
|
</tr> |
|
<tr> |
|
<td rowspan="3">ViMRC<sub>large</sub></td> |
|
<td>InfoXLM<sub>large</sub></td> |
|
<td>54.36</td> |
|
<td>64.14</td> |
|
<td>56.84</td> |
|
<td>9798</td> |
|
</tr> |
|
<tr> |
|
<td>XLM-R<sub>large</sub></td> |
|
<td>53.98</td> |
|
<td>66.70</td> |
|
<td>57.77</td> |
|
<td>9809</td> |
|
</tr> |
|
<tr> |
|
<td>Ernie-M<sub>large</sub></td> |
|
<td>56.62</td> |
|
<td>62.19</td> |
|
<td>58.91</td> |
|
<td>9833</td> |
|
</tr> |
|
<tr> |
|
<td rowspan="3">InfoXLM<sub>large</sub></td> |
|
<td>InfoXLM<sub>large</sub></td> |
|
<td>53.50</td> |
|
<td>63.83</td> |
|
<td>56.17</td> |
|
<td>10057</td> |
|
</tr> |
|
<tr> |
|
<td>XLM-R<sub>large</sub></td> |
|
<td>53.32</td> |
|
<td>66.70</td> |
|
<td>57.25</td> |
|
<td>10066</td> |
|
</tr> |
|
<tr> |
|
<td>Ernie-M<sub>large</sub></td> |
|
<td>56.34</td> |
|
<td>62.36</td> |
|
<td>58.69</td> |
|
<td>10078</td> |
|
</tr> |
|
<tr> |
|
<th colspan="2">LLM</th> |
|
<th colspan="4"></th> |
|
</tr> |
|
<tr> |
|
<td colspan="2">Qwen2.5-1.5B-Instruct</td> |
|
<td>59.23</td> |
|
<td>66.68</td> |
|
<td>65.51</td> |
|
<td>19780</td> |
|
</tr> |
|
<tr> |
|
<td colspan="2">Qwen2.5-3B-Instruct</td> |
|
<td>60.87</td> |
|
<td>66.92</td> |
|
<td>66.10</td> |
|
<td>31284</td> |
|
</tr> |
|
<tr> |
|
<th colspan="1">LLM</th> |
|
<th colspan="1">VC</th> |
|
<th colspan="4"></th> |
|
</tr> |
|
<tr> |
|
<td rowspan="3">Qwen2.5-1.5B-Instruct</td> |
|
<td>InfoXLM<sub>large</sub></td> |
|
<td>64.40</td> |
|
<td>68.37</td> |
|
<td>66.49</td> |
|
<td>19970</td> |
|
</tr> |
|
<tr> |
|
<td>XLM-R<sub>large</sub></td> |
|
<td>64.66</td> |
|
<td>69.63</td> |
|
<td>66.72</td> |
|
<td>19976</td> |
|
</tr> |
|
<tr> |
|
<td>Ernie-M<sub>large</sub></td> |
|
<td>65.70</td> |
|
<td>68.37</td> |
|
<td>67.33</td> |
|
<td>20003</td> |
|
</tr> |
|
<tr> |
|
<td rowspan="3">Qwen2.5-3B-Instruct</td> |
|
<td>InfoXLM<sub>large</sub></td> |
|
<td>65.72</td> |
|
<td>69.66</td> |
|
<td>67.51</td> |
|
<td>31477</td> |
|
</tr> |
|
<tr> |
|
<td>XLM-R<sub>large</sub></td> |
|
<td>66.12</td> |
|
<td>70.44</td> |
|
<td>67.83</td> |
|
<td>31483</td> |
|
</tr> |
|
<tr> |
|
<td>Ernie-M<sub>large</sub></td> |
|
<td>67.48</td> |
|
<td>70.77</td> |
|
<td>68.75</td> |
|
<td>31512</td> |
|
</tr> |
|
<tr> |
|
<th colspan="1">SER Faster (ours)</th> |
|
<th colspan="1">TVC (ours)</th> |
|
<th colspan="4"></th> |
|
</tr> |
|
<tr> |
|
<td>TF-IDF + ViMRC<sub>large</sub></td> |
|
<td>Ernie-M<sub>large</sub></td> |
|
<td style="color:blue">78.32</td> |
|
<td style="color:blue">81.91</td> |
|
<td style="color:blue">80.26</td> |
|
<td style="color:blue">995</td> |
|
</tr> |
|
<tr> |
|
<td>TF-IDF + InfoXLM<sub>large</sub></td> |
|
<td>Ernie-M<sub>large</sub></td> |
|
<td style="color:blue">78.37</td> |
|
<td style="color:blue">81.91</td> |
|
<td style="color:blue">80.32</td> |
|
<td style="color:blue">925</td> |
|
</tr> |
|
<tr> |
|
<th colspan="1">SER (ours)</th> |
|
<th colspan="1">TVC (ours)</th> |
|
<th colspan="4"></th> |
|
</tr> |
|
<tr> |
|
<td rowspan="3">TF-IDF + ViMRC<sub>large</sub></td> |
|
<td>InfoXLM<sub>large</sub></td> |
|
<td>75.13</td> |
|
<td>79.54</td> |
|
<td>76.87</td> |
|
<td>5191</td> |
|
</tr> |
|
<tr> |
|
<td>XLM-R<sub>large</sub></td> |
|
<td>76.71</td> |
|
<td>81.65</td> |
|
<td>78.91</td> |
|
<td>5219</td> |
|
</tr> |
|
<tr> |
|
<td>Ernie-M<sub>large</sub></td> |
|
<td><strong>78.97</strong></td> |
|
<td><strong>82.54</strong></td> |
|
<td><strong>80.91</strong></td> |
|
<td>5225</td> |
|
</tr> |
|
<tr> |
|
<td rowspan="3">TF-IDF + InfoXLM<sub>large</sub></td> |
|
<td>InfoXLM<sub>large</sub></td> |
|
<td>75.13</td> |
|
<td>79.60</td> |
|
<td>76.87</td> |
|
<td>5175</td> |
|
</tr> |
|
<tr> |
|
<td>XLM-R<sub>large</sub></td> |
|
<td>76.74</td> |
|
<td>81.71</td> |
|
<td>78.95</td> |
|
<td>5200</td> |
|
</tr> |
|
<tr> |
|
<td>Ernie-M<sub>large</sub></td> |
|
<td><strong>78.97</strong></td> |
|
<td>82.49</td> |
|
<td><strong>80.91</strong></td> |
|
<td>5297</td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
|
|
**SemViQA-QATC** plays a crucial role in the **SemViQA** system by enhancing accuracy in evidence extraction. When integrated into a pipeline, this model helps determine whether a claim is supported or refuted based on retrieved evidence. |
|
|
|
## **Citation** |
|
|
|
If you use **SemViQA-QATC** in your research, please cite: |
|
|
|
```bibtex |
|
@misc{tran2025semviqasemanticquestionanswering, |
|
title={SemViQA: A Semantic Question Answering System for Vietnamese Information Fact-Checking}, |
|
author={Dien X. Tran and Nam V. Nguyen and Thanh T. Tran and Anh T. Hoang and Tai V. Duong and Di T. Le and Phuc-Lu Le}, |
|
year={2025}, |
|
eprint={2503.00955}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2503.00955}, |
|
} |
|
``` |
|
|
|
๐ **Paper Link:** [SemViQA on arXiv](https://arxiv.org/abs/2503.00955) |
|
๐ **Source Code:** [GitHub - SemViQA](https://github.com/DAVID-NGUYEN-S16/SemViQA) |
|
|
|
## About |
|
|
|
*Built by Dien X. Tran* |
|
[](https://www.linkedin.com/in/xndien2004/) |
|
For more details, visit the project repository. |
|
[](https://github.com/DAVID-NGUYEN-S16/SemViQA) |