VLSP2021 - Vietnamese Machine Reading Comprehension

Organized by vlsp-mrc-2021-organisers - Current server time: Jan. 24, 2022, 9:46 a.m. UTC

First phase

Trial
Sept. 30, 2021, 5 p.m. UTC

End

Competition Ends
Oct. 27, 2021, 4:59 p.m. UTC

Task Description

Machine Reading Comprehension (MRC) has lately emerged as an area in computational linguistics (CL) in which automatic systems are developed to find correct answers to questions posed in human language, given documents containing the answers. The task of Vietnamese Machine Reading Comprehension is the extraction-based machine reading comprehension on Vietnamese Wikipedia-based texts. Based on SQuAD [1, 2], we developed Vietnamese Question Answering Dataset (UIT-ViQuAD), which is a reading comprehension dataset, consisting of questions posed by crowd-workers on a set of Wikipedia Vietnamese articles, where the answer to every question is a span of text, from the corresponding reading passage, or the question might be unanswerable.

UIT-ViQuAD2.0 combines the 23K questions in UIT-ViQuAD 1.0 [3] with over 12K unanswerable questions written adversarially by crowd-workers to look similar to answerable ones. To do well on UIT-ViQuAD 2.0, MRC systems must not only answer questions when possible but also determine when no answer is supported by the context and abstain from answering. In this task, participating teams use UIT-ViQuAD2.0 to evaluate machine reading comprehension models.

UIT-ViQuAD 1.0, the previous version of the UIT-ViQuAD dataset [3], contains 23K+ question-answer pairs on 170+ articles

IMPORTANT: Before submitting on the system, you must rename your the submission file to results.json, and compressed it as zip file with the name: results.zip

Evaluation Metrics

Following the evaluation metrics on SQuAD2.0 [2], we use EM and F1-score as evaluation metrics for Vietnamese machine reading comprehension:

  • F1-score: F1-score is a popular metric for natural language processing and is also used in machine reading comprehension. F1-score estimated over the individual tokens in the predicted answer against those in the gold standard answers. The F1-score is based on the number of matched tokens between the predicted and gold standard answers.

                    Precision=(the number of matched tokens)/(the total number of tokens in the predicted answer)

                    Recall=(the number of matched tokens)/(the total number of tokens in the gold standard answer)

                    F1-score=(2*Precision*Recall)/(Precision+Recall)

  • Exact Match (EM): For each question-answer pair, if the characters of the MRC system's predicted answer exactly match the characters of (one of) the gold standard answer(s), EM = 1, otherwise EM = 0. EM is a stringent all-or-nothing metric, with a score of 0 for being off by a single character. When evaluating against a negative question, if the system predicts any textual span as an answer, it automatically obtains a zero score for that question.

The final ranking is evaluated on the test set, according to the F1-score (EM as a secondary metric when there is a tie). The results are round to the nearest hundredth (3 decimal places). If 2 teams have the same F1 score, EM score is used to determine which team is better.

The task's evaluation script: https://drive.google.com/file/d/1vn6Aed4nacSD932YezQgvWNIOx_1PCb4/view?usp=sharing

Terms:

  • All teams must provide pre-trained embedding and pre-trained language models that you use in this contest before Oct 10, 2021, and do not use any external resources related to machine reading comprehension and question answering for model training except data provided by organizers. If you use pre-trained embedding and pre-trained language models that are not on the list provided by the participating teams or using the external resources related to machine reading comprehension and question answering, the final result is not accepted.
  • The team's name cannot be "BASELINE", "Baseline", and "baseline" because this name makes confusion between the participants' models with our baseline model.
  • In the public test phase, the system allows 10 submissions per day. In the private test phase, only 1 submission per day is allowed.
  • The top 3 teams are required to submit the technical paper to VLSP 2021 to get your achievement acknowledged. If any top teams did not submit their papers, follow-up teams can submit and take their places. The top 3 teams may be required to provide source code to examine the final results.

Submission guidelines:

The submission file is in JSON format, and must be named as: results.json

The JSON content is structured as:  

{

      “<id_of_question>” : “answers text”,

       …..

}

 

Here is an example of JSON format for the submission file:

{

    “uit_034_35”: “Paris là kinh đô ánh sáng”,

    “uit_035_57”: “”,

    “uit_037_12”: “Paris là thủ đô Cộng hoà Pháp”,

    …...

}

Before submitting on the system, the submission file must be compressed as zip file with the name: results.zip

Dataset Information

We provide UIT-ViQuAD2.0 consisting of over 35K questions to participating teams. The dataset is stored in .json format. Here are a few question examples extracted from the dataset.

Context: Khác với nhiều ngôn ngữ Ấn-Âu khác, tiếng Anh đã gần như loại bỏ hệ thống biến tố dựa trên cách để thay bằng cấu trúc phân tích. Đại từ nhân xưng duy trì hệ thống cách hoàn chỉnh hơn những lớp từ khác. Tiếng Anh có bảy lớp từ chính: động từ, danh từ, tính từ, trạng từ, hạn định từ (tức mạo từ), giới từ, và liên từ. Có thể tách đại từ khỏi danh từ, và thêm vào thán từ.

question : Tiếng Anh có bao nhiêu loại từ?

is_impossible : False. // There exists an answer to the question.

answer : bảy.

-----------------

question : Ngôn ngữ Ấn-Âu có bao nhiêu loại từ?

is_impossible : True. // There is no correct answer extracted from the Context.

plausible_answer : bảy. // A plausible but incorrect answer extracted from the Context has the same type which the question aims to.

 

Note: All data are transferred to participating teams via email.

[1] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. "SQuAD: 100,000+ Questions for Machine Comprehension of Text." Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016.

[2] Pranav Rajpurkar, Robin Jia, and Percy Liang. "Know What You Don’t Know: Unanswerable Questions for SQuAD." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2018.

[3] Kiet Van Nguyen, Duc-Vu Nguyen, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen. "A Vietnamese Dataset for Evaluating Machine Reading Comprehension." Proceedings of the 28th International Conference on Computational Linguistics. 2020.

Trial

Start: Oct. 1, 2021, midnight

Public Test

Start: Oct. 5, 2021, midnight

Description: Please name your team

Private Test

Start: Oct. 25, 2021, midnight

Competition Ends

Oct. 27, 2021, 11:59 p.m.

You must be logged in to participate in competitions.

Sign In