新增科大讯飞比赛baseline

Torres9999 · Aug 18, 2021 · 40019cc · 40019cc
1 parent 1dd87af
commit 40019cc
Show file tree

Hide file tree

Showing 9 changed files with 2,199 additions and 4 deletions.
diff --git a/.DS_Store b/.DS_Store
diff --git a/competition/科大讯飞AI开发者大赛2021/.DS_Store b/competition/科大讯飞AI开发者大赛2021/.DS_Store
diff --git a/competition/科大讯飞AI开发者大赛2021/READMD.md b/competition/科大讯飞AI开发者大赛2021/READMD.md
diff --git a/competition/科大讯飞AI开发者大赛2021/中文成语填空挑战赛/README.md b/competition/科大讯飞AI开发者大赛2021/中文成语填空挑战赛/README.md
@@ -0,0 +1,20 @@
+## 中文成语填空挑战赛
+
+中国文化博大精深源远流长，其中成语更是中国文化的精华。成语大多由四个字组成，一般都有典故或出处。有些成语从字面上不难理解，如“小题大做”、“后来居上”等。有些成语必须知道来源或典故才能懂得意思，如“朝三暮四”、“杯弓蛇影”等。
+
+成语学习是小学语文和初中重要的学习内容，如何在语句中选择合适的成语？本次赛题中希望选手构建模型能理解中文成语。
+
+比赛链接：http://challenge.xfyun.cn/topic/info?type=chinese-idioms&ch=dw-sq-1
+
+| text      | 曾经在越南这个全球第四大网游市场占据80%的金山游戏CEO邹涛对记者表示：“海外市场的本土网游企业也在崛起，这一点在越南等东南亚市场表现尤其明显，越南本土游戏公司[MASK][MASK][MASK][MASK]，再加上更多的中国企业瞄准这一市场，竞争更加激烈 |
+| ------------- | ------------------------------------------------------------ |
+| candidate | 张王赵李, 海不波溢, 七男八婿, 异军突起                       |
+| label     | 异军突起                                                     |
+|               |                                                              |
+
+训练集5w条数据，测试集1w条数据，均为csv格式，列使用\t分割。测试集中label字段为空，需要选手预测。
+
+
+## 赛事任务
+
+给定一个中文句子的情况下，需要选手在给定上下文的情况下从待选的成语中选择最为合适的成语。即给定句子的上下文，完成合适的成语填入对应位置。
diff --git a/competition/科大讯飞AI开发者大赛2021/中文成语填空挑战赛/baseline.py b/competition/科大讯飞AI开发者大赛2021/中文成语填空挑战赛/baseline.py
diff --git a/competition/科大讯飞AI开发者大赛2021/中文成语填空挑战赛/gen_train_test.py b/competition/科大讯飞AI开发者大赛2021/中文成语填空挑战赛/gen_train_test.py
@@ -0,0 +1,111 @@
+#!/usr/bin/env python
+# -*- coding:utf-8 -*-
+# author:quincy qiang
+# email:yanqiangmiffy@gmail.com
+# datetime:2021/8/16 11:20
+# description:"do something"
+
+import re
+import pandas as pd
+from tqdm import tqdm
+
+train = pd.read_csv('data/train.csv', sep='\t')
+test = pd.read_csv('data/test.csv', sep='\t')
+
+print(train)
+print(test)
+
+
+def process_text(text):
+    return re.sub(' +', ' ', text).strip()
+
+
+def get_question(text):
+    """
+    根据[MASK][MASK][MASK][MASK]获取问题
+    :param text:
+    :return:
+    """
+    sentences = re.split('(。|！|\!|\.|？|\?)', text)  # 保留分割符
+    for sent in sentences:
+        if '[MASK][MASK][MASK][MASK]' in sent:
+            return sent
+    return text
+
+
+cols = [
+    "Unnamed: 0",
+    "video-id",
+    "fold-ind",  # q_id
+    "startphrase",
+    "sent1",  # content
+    "sent2",  # question
+    "gold-source",
+    "ending0", "ending1", "ending2", "ending3",  # choice
+    "label"]
+
+# ======================================================
+# 生成训练集
+# ======================================================
+res = []
+
+for idx, row in tqdm(train.iterrows()):
+    q_id = f'train_{idx}'
+    content = row['text']
+    content = process_text(content)
+    question = get_question(content)
+    modified_choices = eval(row['candidate'])
+    label = modified_choices.index(row['label'])
+    ## Hard-code for swag format!
+    res.append(("",
+                "",
+                q_id,
+                "",
+                content,
+                question,
+                "",
+                modified_choices[0],
+                modified_choices[1],
+                modified_choices[2],
+                modified_choices[3],
+                label))
+df = pd.DataFrame(res, columns=cols)
+
+# ======================================================
+# 生成测试集
+# ======================================================
+res = []
+print("test.shape", test.shape)
+for idx, row in tqdm(test.iterrows()):
+    q_id = f'test_{idx}'
+    content = row['text']
+    content = process_text(content)
+    question = get_question(content)
+    modified_choices = eval(row['candidate'])
+    ## Hard-code for swag format!
+    res.append(("",
+                "",
+                q_id,
+                "",
+                content,
+                question,
+                "",
+                modified_choices[0],
+                modified_choices[1],
+                modified_choices[2],
+                modified_choices[3],
+                0))
+df_test = pd.DataFrame(res, columns=cols)
+
+print(df_test.shape)
+
+
+DEBUG = False
+if DEBUG:
+    df.iloc[:50].to_csv('data/new_train.csv', index=False)
+    df.iloc[-50:].to_csv('data/new_valid.csv', index=False)
+    df_test.iloc[:50].to_csv('data/new_test.csv', index=False)
+else:
+    df.iloc[:45000].to_csv('data/new_train.csv', index=False)
+    df.iloc[5000:].to_csv('data/new_valid.csv', index=False)
+    df_test.to_csv('data/new_test.csv', index=False)
diff --git a/competition/科大讯飞AI开发者大赛2021/中文成语填空挑战赛/run.sh b/competition/科大讯飞AI开发者大赛2021/中文成语填空挑战赛/run.sh
@@ -0,0 +1,19 @@
+#!/bin/bash
+
+python -u baseline.py \
+  --model_name_or_path 'hfl/chinese-xlnet-base' \
+  --do_train \
+  --do_eval \
+  --do_predict \
+  --logging_steps=100 \
+  --max_seq_length 200 \
+  --train_file data/new_train.csv \
+  --validation_file data/new_valid.csv \
+  --test_file data/new_test.csv \
+  --learning_rate 3e-5 \
+  --num_train_epochs 2 \
+  --output_dir 'models/xlnet' \
+  --gradient_accumulation_steps 4 \
+  --per_device_eval_batch_size 16 \
+  --per_device_train_batch_size 16 \
+  --overwrite_output
diff --git a/competition/科大讯飞AI开发者大赛2021/移动设备用户年龄和性别预测/README.md b/competition/科大讯飞AI开发者大赛2021/移动设备用户年龄和性别预测/README.md
@@ -0,0 +1,10 @@
+## 赛事背景
+对于移动设备厂商而言，获取当前手机用户的人口属性信息是非常困难的。基于用户的手机及日常使用应用程序的偏好准确地预测其人口属性信息是提升个性化体验、构建精准用户画像的基础。
+
+需要说明的是，本赛事数据已获得个人用户的充分认可和同意，并已进行适当的匿名处理以保护隐私。由于保密，我们不会提供有关如何获得性别和年龄数据的详细信息。
+
+赛题链接：http://challenge.xfyun.cn/topic/info?type=mobile-devices&ch=dw-sq-1
+
+## 赛事任务
+
+本次比赛有两个任务，分别对移动设备（device_id）进行性别和年龄的预测，这里包含二分类和回归两个问题，最终会将两个部分的分数结合起来进行排名。