File size: 12,736 Bytes
1ac6a2a
 
 
77eae3a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ac4164d
77eae3a
ac4164d
 
77eae3a
ac4164d
77eae3a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
---
license: apache-2.0
---
# MetricX-23

*This is not an officially supported Google product.*

**GitHub repository: [https://github.com/google-research/metricx](https://github.com/google-research/metricx)**

This repository contains the MetricX-23 models,
a family of models for automatic evaluation of translations that were proposed
in the WMT'23 Metrics Shared Task submission
[MetricX-23: The Google Submission to the WMT 2023 Metrics Shared Task](https://aclanthology.org/2023.wmt-1.63/).
The models were trained in [T5X](https://github.com/google-research/t5x) and
then converted for use in PyTorch.

## Available Models
There are 6 models available on HuggingFace that vary in the number of
parameters and whether or not the model is reference-based or reference-free
(also known as quality estimation, or QE):

* [MetricX-23-XXL](https://huggingface.co/google/metricx-23-xxl-v2p0)
* [MetricX-23-XL](https://huggingface.co/google/metricx-23-xl-v2p0)
* [MetricX-23-Large](https://huggingface.co/google/metricx-23-large-v2p0)
* [MetricX-23-QE-XXL](https://huggingface.co/google/metricx-23-qe-xxl-v2p0)
* [MetricX-23-QE-XL](https://huggingface.co/google/metricx-23-qe-xl-v2p0)
* [MetricX-23-QE-Large](https://huggingface.co/google/metricx-23-qe-large-v2p0)

We recommend using the XXL model versions for the best agreement with human
judgments of translation quality, the Large versions for best speed, and the
XL for an intermediate use case.


## Changes to the WMT'23 Submission

These models available here are most similar to the primary submission to the WMT'23 Metrics
Shared Task. They are initialized with [mT5](https://aclanthology.org/2021.naacl-main.41/)
then fine-tuned on a combination of direct assessment and MQM data. However,
we made some changes that make these models different from the WMT'23 submissions.

First, the models are trained to regress the actual MQM score rather than a
normalized score between 0 and 1. **That means the output from the MetricX-23
models is a score in the range [0, 25] where lower is better (i.e., it predicts
an error score).**

Second, these models were trained with a larger variety of synthetic data that
makes them more robust to translation edge cases like over- and undertranslation,
described in more detail in the following section.

### Synthetic Data

In order for our MetricX models to learn to identify certain types of bad
translations that are not sufficiently (or at all) represented in the regular
training data, we created synthetic examples and mixed them in during training.
The synthetic training data was generated from the DA datasets ranging from
WMT15 to WMT21 (~ 43 language pairs). In most cases, the synthetic examples have
the candidate translation manipulated so as to turn it into a bad translation
with a specific issue commonly unrecognized by learned metrics.

The table below provides an overview of the various failure modes that we
considered, including brief descriptions of how we prepared the synthetic data
to address them.

| Failure mode | Synthetic example description |
| ----------- | ----------- |
| Undertranslation | Candidate translation with an arbitrary sentence removed (if multi-sentence); alternatively, candidate with a certain proportion of words removed from the end. |
| Overtranslation | Candidate translation duplicated (with space in between). |
| Fluent but unrelated translation | Arbitrary reference of a similar length from the dataset. |
| Gibberish | Text of a similar length as the reference, generated by sampling words from the reference translation vocabulary (built from all references in the data). |
| Missing punctuation | Reference translation with the end punctuation removed (11 punctuation symbols considered). |
| Latin instead of Chinese/Japanese or Hindi/Bengali punctuation | Candidate translation with the language-specific punctuation symbol at the end replaced with the Latin equivalent (e.g., "." instead of "。" or "।"); alternatively, the punctuation symbol is replaced with the Latin equivalent in the reference, keeping the correct one in the candidate. |
| Reference-matching translation | Reference translation copied as the candidate translation (unlike the rest of the synthetic data, these examples are meant to train the metric to predict a perfect score for candidates matching the reference). |

Examples from the first 4 categories were assigned a label corresponding to the
worst score on the given rating scale (e.g., 25 when mixed with MQM training
data), whereas the reference-matching translation examples are assigned the best
score (e.g., 0 when used with MQM data). The missing/incorrect punctuation
examples were labeled with a score slightly worse than perfect.

Note that some of the synthetic datasets are only meaningful in the
reference-based scenario, and we thus excluded them when training a QE variant
of MetricX. These are the Latin-vs-special punctuation and the
reference-matching translation examples.

Most of the synthetic training sets were created using stratified sampling
across target languages, taking 500 examples per target language. One exception
is the missing punctuation set, which used a stratified sample across different
punctuation symbols instead.

When training MetricX, a small proportion of the synthetic examples was mixed
with the regular training examples. During the first-stage fine-tuning on DA
data, each synthetic training set constituted between 0.1% and 1% of all
training examples, whereas in the second-stage fine-tuning on MQM data we used
an even smaller proportion, around 0.05%.

As for evaluating the effect of the synthetic training data on the model's
performance, the DEMETR challenge set - which we originally used to evaluate the
models submitted to the WMT23 Metrics Shared Task - was not adequate anymore. We
therefore created a new DEMETR-style test set based on the WMT22 DA data, with
examples constructed analogically to the synthetic training examples, as
described above. This test set helped us determine the right proportions of
synthetic data for fine-tuning in order to make MetricX robust for the failure
modes in consideration, without sacrificing the system- and segment-level
correlations with human ratings.

## Usage

The code for using MetricX models can be found at [https://github.com/google-research/metricx](https://github.com/google-research/metricx).
The repository contains example prediction scripts, described below.

The `metricx23/predict.py` script contains an example for how to run inference
on the models.

### Reference-Based
Example usage for a reference-based model:

```bash
python -m metricx23.predict \
  --tokenizer google/mt5-xl \
  --model_name_or_path google/metricx-23-xl-v2p0 \
  --max_input_length 1024 \
  --batch_size 1 \
  --input_file input.jsonl \
  --output_file output.jsonl
```

`input.jsonl` is expected to have 1 serialized JSON object per line with
`"reference"` and `"hypothesis"` fields. The output jsonl will be parallel
to `input.jsonl` but additionally contain a `"prediction"` field with the predicted score.

Note that the model was trained with a maximum input length of 1024 tokens, so
significantly increasing that value may lead to unpredictable behavior.

### Reference-Free
Example usage for a reference-free model:

```bash
python -m metricx23.predict \
  --tokenizer google/mt5-xl \
  --model_name_or_path google/metricx-23-qe-xl-v2p0 \
  --max_input_length 1024 \
  --batch_size 1 \
  --input_file input.jsonl \
  --output_file output.jsonl \
  --qe
```

`input.jsonl` is expected to have 1 serialized JSON object per line with
`"source"` and `"hypothesis"` fields. The output jsonl will be parallel
to `input.jsonl` but additionally contain a `"prediction"` field with the predicted score.


## Meta-Evaluation
The `metricx23/evaluate.py` script contains code to calculate various correlations
between the MetricX-23 scores and MQM ratings of translation quality using the
[MT Metrics Eval](https://github.com/google-research/mt-metrics-eval) library.

Example usage:

```bash
python -m metricx23.evaluate \
  --dataset wmt22 \
  --lp en-de \
  --input_file input.jsonl \
  --output_file output.json
```

`input.jsonl` is expected to have one JSON object serialized per line.
Each JSON object is expected to contain 4 fields:

* `"system_id"`: The name of the system that generated the translation.
* `"segment_id"`: The 0-based index of the corresponding segment in the MT
Metrics Eval data.
* `"label"`: The ground-truth translation quality score (with higher is better).
* `"prediction"`: The model predicted translation quality score (with lower is
better; the script negates the scores so higher is better).

The script will calculate the 4 agreement/correlations that were used in the
WMT'23 Shared Task. Below are the results for the MetricX-23 models on the
WMT'22 Metrics Shared Task data:

English-German:

| Model      | System-Level Accuracy | System-Level Pearson | Segment-Level Pearson | Segment-Level Pairwise Acc |
| ----------- | ----------- | ----------- | ----------- | ----------- |
| MetricX-23-XXL      | 0.795       | 0.835       | 0.546       | 0.619       |
| MetricX-23-XL   | 0.756        | 0.813       | 0.540       | 0.605       |
| MetricX-23-Large   | 0.769        | 0.759       | 0.507       | 0.595       |
| MetricX-23-QE-XXL   | 0.769        | 0.830       | 0.490       | 0.606       |
| MetricX-23-QE-XL   | 0.718        | 0.684       | 0.421       | 0.594       |
| MetricX-23-QE-Large   | 0.744        | 0.671       | 0.387       | 0.579       |

English-Russian:

| Model      | System-Level Accuracy | System-Level Pearson | Segment-Level Pearson | Segment-Level Pairwise Acc |
| ----------- | ----------- | ----------- | ----------- | ----------- |
| MetricX-23-XXL      | 0.905       | 0.943       | 0.477       | 0.609       |
| MetricX-23-XL   | 0.876        | 0.906       | 0.498       | 0.589       |
| MetricX-23-Large   | 0.876        | 0.841       | 0.474       | 0.569       |
| MetricX-23-QE-XXL   | 0.895        | 0.940       | 0.470       | 0.602       |
| MetricX-23-QE-XL   | 0.848        | 0.861       | 0.415       | 0.570       |
| MetricX-23-QE-Large   | 0.819        | 0.778       | 0.411       | 0.551       |

Chinese-English:

| Model      | System-Level Accuracy | System-Level Pearson | Segment-Level Pearson | Segment-Level Pairwise Acc |
| ----------- | ----------- | ----------- | ----------- | ----------- |
| MetricX-23-XXL      | 0.868       | 0.919       | 0.605       | 0.551       |
| MetricX-23-XL   | 0.868        | 0.924       | 0.584       | 0.543       |
| MetricX-23-Large   | 0.857        | 0.919       | 0.555       | 0.539       |
| MetricX-23-QE-XXL   | 0.857        | 0.928       | 0.573       | 0.544       |
| MetricX-23-QE-XL   | 0.802        | 0.879       | 0.546       | 0.529       |
| MetricX-23-QE-Large   | 0.758        | 0.904       | 0.522       | 0.529       |


The `metricx23/evaluate_wmt23.py` script re-calculates the average correlation
score that was used to rank submissions from the
[WMT'23 Shared Task](https://www2.statmt.org/wmt23/pdf/2023.wmt-1.51.pdf).

Example usage:

```bash
python -m metricx23.evaluate_wmt23 \
  --en_de predictions_ende.jsonl \
  --he_en predictions_heen.jsonl \
  --zh_en predictions_zhen.jsonl \
  --output_file output.json
```

Each of the 3 input files is expected to be in the same format as described
above. Each file should correspond to running inference on each of the language
pairs from the WMT'23 dataset.

The results for each of the models is the following:

| Model      | Average Correlation |
| ----------- | ----------- |
| MetricX-23-XXL      | 0.812       |
| MetricX-23-XL   | 0.813        |
| MetricX-23-Large   | 0.794        |
| MetricX-23-QE-XXL   | 0.797        |
| MetricX-23-QE-XL   | 0.767        |
| MetricX-23-QE-Large   | 0.762        |


## Citation
If you use MetricX-23 in your research, please cite the following publication:

```bibtex
@inproceedings{juraska-etal-2023-metricx,
    title = {{MetricX-23: The Google Submission to the WMT 2023 Metrics Shared Task}},
    author = "Juraska, Juraj  and
      Finkelstein, Mara  and
      Deutsch, Daniel  and
      Siddhant, Aditya  and
      Mirzazadeh, Mehdi  and
      Freitag, Markus",
    editor = "Koehn, Philipp  and
      Haddow, Barry  and
      Kocmi, Tom  and
      Monz, Christof",
    booktitle = "Proceedings of the Eighth Conference on Machine Translation",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.wmt-1.63",
    doi = "10.18653/v1/2023.wmt-1.63",
    pages = "756--767",
}
```