自然语言处理 Paddle NLP - 信息抽取技术及应用
重点:SOP 图、BCEWithLogitsLoss
信息抽取旨在从非结构化自然语言文本中提取结构化知识,如实体、关系、事件等。对于给定的自然语言句子,根据预先定义的schema集合,抽取出所有满足schema约束的SPO三元组。
例如,「妻子」关系的schema定义为:
{ S_TYPE: 人物, P: 妻子, O_TYPE: { @value: 人物 } }
该示例展示了如何使用PaddleNLP快速完成实体关系抽取,参与千言信息抽取-关系抽取比赛打榜。
针对 DuIE2.0 任务中多条、交叠SPO这一抽取目标,比赛对标准的 'BIO' 标注进行了扩展。
对于每个 token,根据其在实体span中的位置(包括B、I、O三种),我们为其打上三类标签,并且根据其所参与构建的predicate种类,将 B 标签进一步区分。给定 schema 集合,对于 N 种不同 predicate,以及头实体/尾实体两种情况,我们设计对应的共 2N 种 B 标签,再合并 I 和 O 标签,故每个 token 一共有 (2N+2) 个标签,如下图所示。
以类别为标注
对测试集上参评系统输出的SPO结果和人工标注的SPO结果进行精准匹配,采用F1值作为评价指标。注意,对于复杂O值类型的SPO,必须所有槽位都精确匹配才认为该SPO抽取正确。针对部分文本中存在实体别名的问题,使用百度知识图谱的别名词典来辅助评测。F1值的计算方式如下:
F1 = (2 * P * R) / (P + R),其中
该任务可以看作一个序列标注任务,所以基线模型采用的是ERNIE序列标注模型。
PaddleNLP提供了ERNIE预训练模型常用序列标注模型,可以通过指定模型名字完成一键加载。PaddleNLP为了方便用户处理数据,内置了对于各个预训练模型对应的Tokenizer,可以完成文本token化,转token ID,文本长度截断等操作。
文本数据处理直接调用tokenizer即可输出模型所需输入数据。
├── dev_data.json ├── dev.json ├── duie.json ├── duie.json.zip ├── duie_sample │ └── License.docx ├── id2spo.json ├── predicate2id.json # 有多少类型 ├── schema.xlsx ├── test_data.json ├── test.json ├── train_data.json # 训练数据 └── train.json
{ "text":"《邪少兵王》是冰火未央写的网络小说连载于旗峰天下", # 要抽取的一段话 "spo_list":[ # 标签结果(抽多少个三元组) { "predicate":"作者", # 关系:作者 "object_type":{ "@value":"人物" # 尾实体,是个人物 }, "subject_type":"图书作品", # 抽首实体是个 图书作品 "object":{ "@value":"冰火未央" # 尾实体人物的值:冰火未央 }, "subject":"邪少兵王" # 图书作品的值 邪少兵王 } ] }
import os import sys import json from paddlenlp.transformers import ErnieForTokenClassification, ErnieTokenizer # 将 57 种关系标签读进来 label_map_path = os.path.join('/home/aistudio/relation_extraction/data', "predicate2id.json") if not (os.path.exists(label_map_path) and os.path.isfile(label_map_path)): sys.exit("{} dose not exists or is not a file.".format(label_map_path)) with open(label_map_path, 'r', encoding='utf8') as fp: label_map = json.load(fp) # 多标签分类的分类数: 57 - 2(I、O) * 2 (2种尾实体(value、inwork),所以在关系里面也要*2 两种关系)+ 2 (最后把 I、O加回来) # 2N + 2 num_classes = (len(label_map.keys()) - 2) * 2 + 2 # 要做多标签分类问题,所以要把 num_classes 放到 pretrained 里,这边会用到 Sigmoid model = ErnieForTokenClassification.from_pretrained("ernie-1.0", num_classes=(len(label_map) - 2) * 2 + 2) tokenizer = ErnieTokenizer.from_pretrained("ernie-1.0") #inputs = tokenizer(text="请输入测试样例", max_seq_len=20) inputs = tokenizer(text="《邪少兵王》是冰火未央写的网络小说连载于旗峰天下", max_seq_len=20) inputs
1 => CLS、后面是 token id token_type_ids 全是0,因为只有一句话 {'input_ids': [1, 56, 1686, 332, 714, 338, 55, 10, 1161, 610, 556, 946, 519, 5, 305, 742, 96, 178, 538, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
从比赛官网下载数据集,解压存放于data/目录下并重命名为train_data.json, dev_data.json, test_data.json.
我们可以加载自定义数据集。通过继承paddle.io.Dataset
,自定义实现__getitem__
和 __len__
两个方法。
from typing import Optional, List, Union, Dict import numpy as np import paddle from tqdm import tqdm from paddlenlp.transformers import ErnieTokenizer from paddlenlp.utils.log import logger from data_loader import parse_label, DataCollator, convert_example_to_feature from extract_chinese_and_punct import ChineseAndPunctuationExtractor class DuIEDataset(paddle.io.Dataset): def __init__(self, data, label_map, tokenizer, max_length=512, pad_to_max_length=False): super(DuIEDataset, self).__init__() self.data = data self.chn_punc_extractor = ChineseAndPunctuationExtractor() self.tokenizer = tokenizer self.max_seq_length = max_length self.pad_to_max_length = pad_to_max_length self.label_map = label_map def __len__(self): return len(self.data) def __getitem__(self, item): example = json.loads(self.data[item]) input_feature = convert_example_to_feature( example, self.tokenizer, self.chn_punc_extractor, self.label_map, self.max_seq_length, self.pad_to_max_length) return { "input_ids": np.array(input_feature.input_ids, dtype="int64"), "seq_lens": np.array(input_feature.seq_len, dtype="int64"), "tok_to_orig_start_index": np.array(input_feature.tok_to_orig_start_index, dtype="int64"), "tok_to_orig_end_index": np.array(input_feature.tok_to_orig_end_index, dtype="int64"), # If model inputs is generated in `collate_fn`, delete the data type casting. "labels": np.array(input_feature.labels, dtype="float32"), } @classmethod def from_file(cls, file_path, tokenizer, max_length=512, pad_to_max_length=None): assert os.path.exists(file_path) and os.path.isfile( file_path), f"{file_path} dose not exists or is not a file." label_map_path = os.path.join( os.path.dirname(file_path), "predicate2id.json") assert os.path.exists(label_map_path) and os.path.isfile( label_map_path ), f"{label_map_path} dose not exists or is not a file." with open(label_map_path, 'r', encoding='utf8') as fp: label_map = json.load(fp) with open(file_path, "r", encoding="utf-8") as fp: data = fp.readlines() return cls(data, label_map, tokenizer, max_length, pad_to_max_length)
data_path = 'data' batch_size = 32 max_seq_length = 128 train_file_path = os.path.join(data_path, 'train_data.json') train_dataset = DuIEDataset.from_file( train_file_path, tokenizer, max_seq_length, True) # print(len(train_dataset)) # print(train_dataset[0]) train_batch_sampler = paddle.io.BatchSampler( train_dataset, batch_size=batch_size, shuffle=True, drop_last=True) collator = DataCollator() train_data_loader = paddle.io.DataLoader( dataset=train_dataset, batch_sampler=train_batch_sampler, collate_fn=collator) eval_file_path = os.path.join(data_path, 'dev_data.json') # 防止内存溢出,这边用了 _data 结果的试验数据,dev.json 全量数据 17W+ test_dataset = DuIEDataset.from_file( eval_file_path, tokenizer, max_seq_length, True) test_batch_sampler = paddle.io.BatchSampler( test_dataset, batch_size=batch_size, shuffle=False, drop_last=True) test_data_loader = paddle.io.DataLoader( dataset=test_dataset, batch_sampler=test_batch_sampler, collate_fn=collator)
我们选择均方误差作为损失函数,使用paddle.optimizer.AdamW
作为优化器。
在训练过程中,模型保存在当前目录checkpoints文件夹下。同时在训练的同时使用官方评测脚本进行评估,输出P/R/F1指标。
在验证集上F1可以达到69.42。
import paddle.nn as nn # 多标签分类,BCEWithLogitsLoss class BCELossForDuIE(nn.Layer): def __init__(self, ): super(BCELossForDuIE, self).__init__() self.criterion = nn.BCEWithLogitsLoss(reduction='none') def forward(self, logits, labels, mask): loss = self.criterion(logits, labels) mask = paddle.cast(mask, 'float32') # 有的标签是PAD的,不需要计算,减少 mask 计算量 loss = loss * mask.unsqueeze(-1) loss = paddle.sum(loss.mean(axis=2), axis=1) / paddle.sum(mask, axis=1) loss = loss.mean() return loss
from utils import write_prediction_results, get_precision_recall_f1, decoding @paddle.no_grad() def evaluate(model, criterion, data_loader, file_path, mode): """ mode eval: eval on development set and compute P/R/F1, called between training. mode predict: eval on development / test set, then write predictions to \ predict_test.json and predict_test.json.zip \ under /home/aistudio/relation_extraction/data dir for later submission or evaluation. """ example_all = [] with open(file_path, "r", encoding="utf-8") as fp: for line in fp: example_all.append(json.loads(line)) # id2spo.json => {"predicate": ["empty", "empty", "注册资本"..} id2spo_path = os.path.join(os.path.dirname(file_path), "id2spo.json") with open(id2spo_path, 'r', encoding='utf8') as fp: id2spo = json.load(fp) model.eval() loss_all = 0 eval_steps = 0 formatted_outputs = [] current_idx = 0 for batch in tqdm(data_loader, total=len(data_loader)): eval_steps += 1 input_ids, seq_len, tok_to_orig_start_index, tok_to_orig_end_index, labels = batch logits = model(input_ids=input_ids) mask = (input_ids != 0).logical_and((input_ids != 1)).logical_and((input_ids != 2)) loss = criterion(logits, labels, mask) loss_all += loss.numpy().item() probs = F.sigmoid(logits) logits_batch = probs.numpy() seq_len_batch = seq_len.numpy() tok_to_orig_start_index_batch = tok_to_orig_start_index.numpy() tok_to_orig_end_index_batch = tok_to_orig_end_index.numpy() formatted_outputs.extend(decoding(example_all[current_idx: current_idx+len(logits)], id2spo, logits_batch, seq_len_batch, tok_to_orig_start_index_batch, tok_to_orig_end_index_batch)) current_idx = current_idx+len(logits) loss_avg = loss_all / eval_steps print("eval loss: %f" % (loss_avg)) if mode == "predict": predict_file_path = os.path.join("/home/aistudio/relation_extraction/data", 'predictions.json') else: predict_file_path = os.path.join("/home/aistudio/relation_extraction/data", 'predict_eval.json') predict_zipfile_path = write_prediction_results(formatted_outputs, predict_file_path) if mode == "eval": precision, recall, f1 = get_precision_recall_f1(file_path, predict_zipfile_path) os.system('rm {} {}'.format(predict_file_path, predict_zipfile_path)) return precision, recall, f1 elif mode != "predict": raise Exception("wrong mode for eval func")
from paddlenlp.transformers import LinearDecayWithWarmup learning_rate = 2e-5 num_train_epochs = 5 warmup_ratio = 0.06 criterion = BCELossForDuIE() # Defines learning rate strategy. steps_by_epoch = len(train_data_loader) num_training_steps = steps_by_epoch * num_train_epochs lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps, warmup_ratio) optimizer = paddle.optimizer.AdamW( learning_rate=lr_scheduler, parameters=model.parameters(), apply_decay_param_fun=lambda x: x in [ p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])])
# 模型参数保存路径 !mkdir checkpoints
import time import paddle.nn.functional as F # Starts training. global_step = 0 logging_steps = 50 save_steps = 10000 num_train_epochs = 2 output_dir = 'checkpoints' tic_train = time.time() model.train() for epoch in range(num_train_epochs): print("\n=====start training of %d epochs=====" % epoch) tic_epoch = time.time() for step, batch in enumerate(train_data_loader): input_ids, seq_lens, tok_to_orig_start_index, tok_to_orig_end_index, labels = batch logits = model(input_ids=input_ids) mask = (input_ids != 0).logical_and((input_ids != 1)).logical_and( (input_ids != 2)) loss = criterion(logits, labels, mask) loss.backward() optimizer.step() lr_scheduler.step() optimizer.clear_gradients() loss_item = loss.numpy().item() if global_step % logging_steps == 0: print( "epoch: %d / %d, steps: %d / %d, loss: %f, speed: %.2f step/s" % (epoch, num_train_epochs, step, steps_by_epoch, loss_item, logging_steps / (time.time() - tic_train))) tic_train = time.time() if global_step % save_steps == 0 and global_step != 0: print("\n=====start evaluating ckpt of %d steps=====" % global_step) precision, recall, f1 = evaluate( model, criterion, test_data_loader, eval_file_path, "eval") print("precision: %.2f\t recall: %.2f\t f1: %.2f\t" % (100 * precision, 100 * recall, 100 * f1)) print("saving checkpoing model_%d.pdparams to %s " % (global_step, output_dir)) paddle.save(model.state_dict(), os.path.join(output_dir, "model_%d.pdparams" % global_step)) model.train() global_step += 1 tic_epoch = time.time() - tic_epoch print("epoch time footprint: %d hour %d min %d sec" % (tic_epoch // 3600, (tic_epoch % 3600) // 60, tic_epoch % 60)) # Does final evaluation. print("\n=====start evaluating last ckpt of %d steps=====" % global_step) precision, recall, f1 = evaluate(model, criterion, test_data_loader, eval_file_path, "eval") print("precision: %.2f\t recall: %.2f\t f1: %.2f\t" % (100 * precision, 100 * recall, 100 * f1)) paddle.save(model.state_dict(), os.path.join(output_dir, "model_%d.pdparams" % global_step)) print("\n=====training complete=====")
=====start training of 0 epochs===== epoch: 0 / 2, steps: 0 / 5347, loss: 0.741724, speed: 66.93 step/s epoch: 0 / 2, steps: 50 / 5347, loss: 0.733860, speed: 3.39 step/s epoch: 0 / 2, steps: 100 / 5347, loss: 0.705046, speed: 3.35 step/s epoch: 0 / 2, steps: 150 / 5347, loss: 0.633157, speed: 3.30 step/s epoch: 0 / 2, steps: 200 / 5347, loss: 0.410678, speed: 3.24 step/s epoch: 0 / 2, steps: 250 / 5347, loss: 0.302669, speed: 3.31 step/s epoch: 0 / 2, steps: 300 / 5347, loss: 0.254647, speed: 3.29 step/s epoch: 0 / 2, steps: 350 / 5347, loss: 0.224945, speed: 3.31 step/s epoch: 0 / 2, steps: 400 / 5347, loss: 0.201895, speed: 3.26 step/s epoch: 0 / 2, steps: 450 / 5347, loss: 0.179081, speed: 3.20 step/s epoch: 0 / 2, steps: 500 / 5347, loss: 0.159897, speed: 3.30 step/s ......
Step4:提交预测结果
加载训练保存的模型加载后进行预测。
NOTE: 注意设置用于预测的模型参数路径。
set -eux export CUDA_VISIBLE_DEVICES=0 export BATCH_SIZE=8 export CKPT=./checkpoints/model_624.pdparams export DATASET_FILE=./data/test_data.json python run_duie.py \ --do_predict \ --init_checkpoint $CKPT \ --predict_data_file $DATASET_FILE \ --max_seq_length 512 \ --batch_size $BATCH_SIZE
!bash predict.sh
预测结果会被保存在data/predictions.json,data/predictions.json.zip,其格式与原数据集文件一致。
之后可以使用官方评估脚本评估训练模型在dev_data.json上的效果。如:
python re_official_evaluation.py --golden_file=dev_data.json --predict_file=predicitons.json.zip [--alias_file alias_dict]
输出指标为Precision, Recall 和 F1,Alias file包含了合法的实体别名,最终评测的时候会使用,这里不予提供。
之后在test_data.json上预测,然后预测结果(submission.zip文件)至千言评测页面。
基线采用的预训练模型为ERNIE,PaddleNLP提供了丰富的预训练模型,如BERT,RoBERTa,Electra,XLNet等
参考预训练模型文档
如可以选择RoBERTa large中文模型优化模型效果,只需更换模型和tokenizer即可无缝衔接。
from paddlenlp.transformers import RobertaForTokenClassification, RobertaTokenizer model = RobertaForTokenClassification.from_pretrained( "roberta-wwm-ext-large", num_classes=(len(label_map) - 2) * 2 + 2) tokenizer = RobertaTokenizer.from_pretrained("roberta-wwm-ext-large")
原文
https://aistudio.baidu.com/aistudio/projectdetail/1639963?sUid=2631487&shared=1&ts=1686032358184