forked from datawhalechina/competition-baseline
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
1dd87af
commit 40019cc
Showing
9 changed files
with
2,199 additions
and
4 deletions.
There are no files selected for viewing
Binary file not shown.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
## 中文成语填空挑战赛 | ||
|
||
中国文化博大精深源远流长,其中成语更是中国文化的精华。成语大多由四个字组成,一般都有典故或出处。有些成语从字面上不难理解,如“小题大做”、“后来居上”等。有些成语必须知道来源或典故才能懂得意思,如“朝三暮四”、“杯弓蛇影”等。 | ||
|
||
成语学习是小学语文和初中重要的学习内容,如何在语句中选择合适的成语?本次赛题中希望选手构建模型能理解中文成语。 | ||
|
||
比赛链接:http://challenge.xfyun.cn/topic/info?type=chinese-idioms&ch=dw-sq-1 | ||
|
||
| text | 曾经在越南这个全球第四大网游市场占据80%的金山游戏CEO邹涛对记者表示:“海外市场的本土网游企业也在崛起,这一点在越南等东南亚市场表现尤其明显,越南本土游戏公司[MASK][MASK][MASK][MASK],再加上更多的中国企业瞄准这一市场,竞争更加激烈 | | ||
| ------------- | ------------------------------------------------------------ | | ||
| candidate | 张王赵李, 海不波溢, 七男八婿, 异军突起 | | ||
| label | 异军突起 | | ||
| | | | ||
|
||
训练集5w条数据,测试集1w条数据,均为csv格式,列使用\t分割。测试集中label字段为空,需要选手预测。 | ||
|
||
|
||
## 赛事任务 | ||
|
||
给定一个中文句子的情况下,需要选手在给定上下文的情况下从待选的成语中选择最为合适的成语。即给定句子的上下文,完成合适的成语填入对应位置。 |
Large diffs are not rendered by default.
Oops, something went wrong.
111 changes: 111 additions & 0 deletions
111
competition/科大讯飞AI开发者大赛2021/中文成语填空挑战赛/gen_train_test.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,111 @@ | ||
#!/usr/bin/env python | ||
# -*- coding:utf-8 -*- | ||
# author:quincy qiang | ||
# email:yanqiangmiffy@gmail.com | ||
# datetime:2021/8/16 11:20 | ||
# description:"do something" | ||
|
||
import re | ||
import pandas as pd | ||
from tqdm import tqdm | ||
|
||
train = pd.read_csv('data/train.csv', sep='\t') | ||
test = pd.read_csv('data/test.csv', sep='\t') | ||
|
||
print(train) | ||
print(test) | ||
|
||
|
||
def process_text(text): | ||
return re.sub(' +', ' ', text).strip() | ||
|
||
|
||
def get_question(text): | ||
""" | ||
根据[MASK][MASK][MASK][MASK]获取问题 | ||
:param text: | ||
:return: | ||
""" | ||
sentences = re.split('(。|!|\!|\.|?|\?)', text) # 保留分割符 | ||
for sent in sentences: | ||
if '[MASK][MASK][MASK][MASK]' in sent: | ||
return sent | ||
return text | ||
|
||
|
||
cols = [ | ||
"Unnamed: 0", | ||
"video-id", | ||
"fold-ind", # q_id | ||
"startphrase", | ||
"sent1", # content | ||
"sent2", # question | ||
"gold-source", | ||
"ending0", "ending1", "ending2", "ending3", # choice | ||
"label"] | ||
|
||
# ====================================================== | ||
# 生成训练集 | ||
# ====================================================== | ||
res = [] | ||
|
||
for idx, row in tqdm(train.iterrows()): | ||
q_id = f'train_{idx}' | ||
content = row['text'] | ||
content = process_text(content) | ||
question = get_question(content) | ||
modified_choices = eval(row['candidate']) | ||
label = modified_choices.index(row['label']) | ||
## Hard-code for swag format! | ||
res.append(("", | ||
"", | ||
q_id, | ||
"", | ||
content, | ||
question, | ||
"", | ||
modified_choices[0], | ||
modified_choices[1], | ||
modified_choices[2], | ||
modified_choices[3], | ||
label)) | ||
df = pd.DataFrame(res, columns=cols) | ||
|
||
# ====================================================== | ||
# 生成测试集 | ||
# ====================================================== | ||
res = [] | ||
print("test.shape", test.shape) | ||
for idx, row in tqdm(test.iterrows()): | ||
q_id = f'test_{idx}' | ||
content = row['text'] | ||
content = process_text(content) | ||
question = get_question(content) | ||
modified_choices = eval(row['candidate']) | ||
## Hard-code for swag format! | ||
res.append(("", | ||
"", | ||
q_id, | ||
"", | ||
content, | ||
question, | ||
"", | ||
modified_choices[0], | ||
modified_choices[1], | ||
modified_choices[2], | ||
modified_choices[3], | ||
0)) | ||
df_test = pd.DataFrame(res, columns=cols) | ||
|
||
print(df_test.shape) | ||
|
||
|
||
DEBUG = False | ||
if DEBUG: | ||
df.iloc[:50].to_csv('data/new_train.csv', index=False) | ||
df.iloc[-50:].to_csv('data/new_valid.csv', index=False) | ||
df_test.iloc[:50].to_csv('data/new_test.csv', index=False) | ||
else: | ||
df.iloc[:45000].to_csv('data/new_train.csv', index=False) | ||
df.iloc[5000:].to_csv('data/new_valid.csv', index=False) | ||
df_test.to_csv('data/new_test.csv', index=False) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
#!/bin/bash | ||
|
||
python -u baseline.py \ | ||
--model_name_or_path 'hfl/chinese-xlnet-base' \ | ||
--do_train \ | ||
--do_eval \ | ||
--do_predict \ | ||
--logging_steps=100 \ | ||
--max_seq_length 200 \ | ||
--train_file data/new_train.csv \ | ||
--validation_file data/new_valid.csv \ | ||
--test_file data/new_test.csv \ | ||
--learning_rate 3e-5 \ | ||
--num_train_epochs 2 \ | ||
--output_dir 'models/xlnet' \ | ||
--gradient_accumulation_steps 4 \ | ||
--per_device_eval_batch_size 16 \ | ||
--per_device_train_batch_size 16 \ | ||
--overwrite_output |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
## 赛事背景 | ||
对于移动设备厂商而言,获取当前手机用户的人口属性信息是非常困难的。基于用户的手机及日常使用应用程序的偏好准确地预测其人口属性信息是提升个性化体验、构建精准用户画像的基础。 | ||
|
||
需要说明的是,本赛事数据已获得个人用户的充分认可和同意,并已进行适当的匿名处理以保护隐私。由于保密,我们不会提供有关如何获得性别和年龄数据的详细信息。 | ||
|
||
赛题链接:http://challenge.xfyun.cn/topic/info?type=mobile-devices&ch=dw-sq-1 | ||
|
||
## 赛事任务 | ||
|
||
本次比赛有两个任务,分别对移动设备(device_id)进行性别和年龄的预测,这里包含二分类和回归两个问题,最终会将两个部分的分数结合起来进行排名。 |
Oops, something went wrong.