NLP Question-Answering SQuAD-v2

개요

SQuADv2는 Question-Answering 데이터셋으로 컨텍스트와 질문이 결합되어 있습니다.
하나의 컨텍스트에 여러개의 질문이 포함되어 있으며, 컨텍스트로 부터 답변을 찾을 수 없는 질문 또한 존재합니다.

데이터

SQuADv2 원본 데이터셋은 총 35개 타이틀로 구성되어 있습니다.

# Title
['Normans' 'Computational_complexity_theory' 'Southern_California'
 'Sky_(United_Kingdom)' 'Victoria_(Australia)' 'Huguenot' 'Steam_engine'
 'Oxygen' '1973_oil_crisis' 'European_Union_law' 'Amazon_rainforest'
 'Ctenophora' 'Fresno,_California' 'Packet_switching' 'Black_Death'
 'Geology' 'Pharmacy' 'Civil_disobedience' 'Construction' 'Private_school'
 'Harvard_University' 'Jacksonville,_Florida' 'Economic_inequality'
 'University_of_Chicago' 'Yuan_dynasty' 'Immune_system'
 'Intergovernmental_Panel_on_Climate_Change' 'Prime_number' 'Rhine'
 'Scottish_Parliament' 'Islamism' 'Imperialism' 'Warsaw'
 'French_and_Indian_War' 'Force']

본 시범 대회에서 평가에 사용하는 타이틀은 ['Normans' 'Computational_complexity_theory' ‘Southern_California’ ‘Sky_(United_Kingdom)’]로 4개만 사용됩니다.

4개 타이틀에 대한 평가 데이터는 총 1,160개의 Context-Question-Answer 묶음으로 구성되어 있으며, 하나의 Context에는 여러개의 Question이 존재 할 수 있습니다.

또한, 1,160개 질문 중에는 Context로 부터 정확한 답변을 찾을 수 없는 질문 또한 존재합니다.

데이터 예시

Context
Geographical theories such as environmental determinism also suggested that tropical environments created uncivilized people in need of European guidance. For instance, American geographer Ellen Churchill Semple argued that even though human beings originated in the tropics they were only able to become fully human in the temperate zone. Tropicality can be paralleled with Edward Said’s Orientalism as the west’s construction of the east as the “other”. According to Siad, orientalism allowed Europe to establish itself as the superior and the norm, which justified its dominance over the essentialized Orient.

Question	Answer
Which theory suggested people in the tropics were uncivilized?	['environmental determinism']
According to Ellen Churchill Semple what type of climate was necessary for humans to become fully human?	['temperate', 'temperate zone', ‘the temperate zone’]
According to certain Geographical theories what type of human does a tropical climate produce?	['uncivilized', 'fully human', ‘uncivilized people’]
By justification certain racial and geographical theories, Europe thought of itself as what?	['superior', ‘the superior and the norm’]
Which theory suggested people in the tropics were civilized?	[""]
According to Ellen Churchill Semple what type of climate was unnecessary for humans to become fully human?	[""]

평가 방법

평가 메트릭은 f1-score(macro)를 사용합니다.
평가 데이터는 총 1,160개 샘플로 구성되어 있으며, 평가 샘플에 따라 최대 6개의 정답 후보군이 존재합니다.
예측값은 전처리 진행 후 정답 후보군들과의 f1-score를 계산한 뒤, 최대값을 해당 샘플의 최종 f1-score로 결정합니다.
다음 코드를 참고해주세요.

def normalize_text(s):
    """Removing articles and punctuation, and standardizing whitespace are all typical text processing steps."""
    import string, re

    def remove_articles(text):
        regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
        return re.sub(regex, " ", text)

    def white_space_fix(text):
        return " ".join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))
    
def compute_f1(truth, prediction):
    pred_tokens = normalize_text(prediction).split()
    truth_tokens = normalize_text(truth).split()
    
    # if either the prediction or the truth is no-answer then f1 = 1 if they agree, 0 otherwise
    if len(pred_tokens) == 0 or len(truth_tokens) == 0:
        return int(pred_tokens == truth_tokens)
    
    common_tokens = set(pred_tokens) & set(truth_tokens)
    
    # if there are no common tokens then f1 = 0
    if len(common_tokens) == 0:
        return 0
    
    prec = len(common_tokens) / len(pred_tokens)
    rec = len(common_tokens) / len(truth_tokens)
    
    return 2 * (prec * rec) / (prec + rec)


answer_list = ['the British Empire', 'Terra nullius', 'British Empire', 'British']

predict = 'the British Empire'
score = max([compute_f1(answer, predict) for answer in answer_list]) # score=1.0

predict = 'ttthe British Empire'
score = max([compute_f1(answer, predict) for answer in answer_list]) # score=0.8

predict = 'British Empire'
score = max([compute_f1(answer, predict) for answer in answer_list]) # score=1.0

predict = 'British'
score = max([compute_f1(answer, predict) for answer in answer_list]) # score=1.0

제출 규칙

입력 데이터

입력 데이터는 예측 함수로 전달되는 평가 데이터(X_test)를 의미하여, record 방식으로 구성된 리스트 형태로 다음과 같습니다.

input_list = [{"context":c1, "question":q1}, {"context":c2, "question":q2}, ...]
# len(input_list) = 1160

출력 데이터

출력 데이터는 예측 함수가 반환하는 데이터, 즉 예측된 결과에 대한 데이터를 의미합니다. 출력 데이터는 1,160개의 질문에 대한 답변으로 구성 되어야합니다. 만약 질문에 대한 정확한 답변을 할 수 없는 경우에는 빈 문자열 “”로 대체해주세요.

result = ["PredAnswer1", "", ... "PredAnswer1160" ] # len(result) = 1160

제출 코드

AIF 추론자동화는 대회 참여자가 정의한 예측 함수를 기반으로 평가 데이터(X_test) 예측 및 채점이 진행됩니다.
제출 함수는 대회 기간중 학습 완료된 모델을 빌드하는 단계와 빌드된 모델로부터 데이터를 예측하는 함수로 구성 할 수 있습니다.
다음 예시 코드를 참고해 주세요.

import os
import aifactory.grade as aif
import ipynbname
import tensorflow as tf

# 학습 완료된 모델을 빌드합니다.
def build_model():
    model = tf.keras.models.load_model(MY_WEIGHTS_PATH)
    
    return model

# 예측 함수를 정의합니다.
# input_message_list - X_test, 입력 데이터 형식을 참고
# model - 빌드가 완료된 모델, 모델이 불필요한 경우 None
def predict(input_message_list, model):
    result = []
    for input_message in input_message_list:
    	pred = model.predict(input_message) # "aggressive" or "non-aggressive"
    	result.append(pred)
        
    return result

# submit 함수는 정의된 함수들을 순차적으로 리스트 형태로 구성하여 AIF에 제출에 사용되는 함수입니다.
def submit():
    return [build_model, predict]

# AIF_TASK_KEY - 본 태스크 참여자에게 제공되는 키 입니다.("내정보" 페이지에서 확인 가능)
if __name__ == "__main__":  
  filename = ''
  try: 
    filename = ipynbname.name()
  except Exception as e:
    filename = os.path.basename(__file__)

  aif.submit(MODEL_NAME, AIF_TASK_KEY, filename, submit)