A year ago we released a corpus of a thousand of Machine Translation (MT) Errors manually annotated at phrase-level over 400 French-English source sentences and their machine translations, along their human post-edited version, and original references.
The creation of this dataset was motivated by our research on Quality Estimation (QE) and our work on predicting the newly introduced phrase-level. One can see the latter as a way to balance between word- and sentence-level prediction, two well studied levels. However, QE at phrase-level implies that one needs to delimit sub-segments within the segment and we faced the lack of reference annotations to evaluate our segmentation strategies against. Therefore, we created these gold-standard annotations.
We built this dataset with the help of human annotators (all fluent English speakers) whom were asked to identify any ungrammaticalities or variations of meaning that led to incorrect translations. To do so, they compared raw machine translations against their post-edited version, reference and source sentences extracted from the LIG corpus (Potet et al., 2012). The annotations were collected using the Brat Rapid Annotation Tool (a.k.a. BRAT) along with a set of guidelines, and stored in stand-off format. One can find more details about the data collection and the annotation environment in the paper we published at LREC'16:
In addition to the data collection we also describe in this paper the segmentation and labelling strategies we have investigated, as well as the results of the comparison between these strategies for automatic labelling and the gold-standard annotations we have collected.
Finally, to support further research, the dataset is freely accessible under a CC-BY-SA license. We also provide some of our scripts to facilitate reuse of our stand-off annotations with the original content of the LIG corpus (which has to be downloaded separately).