Add files via upload

datawhalechina · Jun 19, 2024 · 5a2e39e · 5a2e39e
1 parent 26c8310
commit 5a2e39e
Showing 1 changed file with 383 additions and 0 deletions.
diff --git a/competition/科大讯飞AI开发者大赛2024/基于术语词典干预的机器翻译挑战赛_baseline.ipynb b/competition/科大讯飞AI开发者大赛2024/基于术语词典干预的机器翻译挑战赛_baseline.ipynb
@@ -0,0 +1,383 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "536d9385-4afa-43d7-9f5d-8b94a7c9dfee",
+   "metadata": {},
+   "source": [
+    "<center><h1><a href=\"https://challenge.xfyun.cn/topic/info?type=machine-translation-2024&option=ssgy&ch=dw24_AtTCK9\">基于术语词典干预的机器翻译挑战赛</a></h1></center>\n",
+    "\n",
+    "# 一、赛事背景\n",
+    "\n",
+    "目前神经机器翻译技术已经取得了很大的突破，但在特定领域或行业中，由于机器翻译难以保证术语的一致性，导致翻译效果还不够理想。对于术语名词、人名地名等机器翻译不准确的结果，可以通过术语词典进行纠正，避免了混淆或歧义，最大限度提高翻译质量。\n",
+    "\n",
+    "# 二、赛事任务\n",
+    "\n",
+    "基于术语词典干预的机器翻译挑战赛选择以英文为源语言，中文为目标语言的机器翻译。本次大赛除英文到中文的双语数据，还提供英中对照的术语词典。参赛队伍需要基于提供的训练数据样本从多语言机器翻译模型的构建与训练，并基于测试集以及术语词典，提供最终的翻译结果，数据包括：\n",
+    "\n",
+    "·训练集：双语数据：中英14万余双语句对\n",
+    "\n",
+    "·开发集：英中1000双语句对\n",
+    "\n",
+    "·测试集：英中1000双语句对\n",
+    "\n",
+    "·术语词典：英中2226条\n",
+    "\n",
+    "# 三、评审规则\n",
+    "\n",
+    "## 1.数据说明\n",
+    "\n",
+    "所有文件均为UTF-8编码，其中测评官方发放的训练集、开发集、测试集和术语词典皆为文本文件，格式如下所示。\n",
+    "\n",
+    "训练集为双语数据，每行为一个句对样本，其格式如图1所示。\n",
+    "\n",
+    "![img](https://openres.xfyun.cn/xfyundoc/2024-05-14/d13583c7-92f7-4f71-b442-a6e7dd39b522/1715665891662/9.png)\n",
+    "\n",
+    "图1 训练集格式\n",
+    "\n",
+    "术语词典格式如图2所示。\n",
+    "\n",
+    "![img](https://openres.xfyun.cn/xfyundoc/2024-05-14/f2fd890f-6ed0-4978-bf0a-10e6ebba81ba/1715665926068/10.png)\n",
+    "\n",
+    "图2 术语词典格式\n",
+    "\n",
+    "## 2.评估指标\n",
+    "\n",
+    "对于参赛队伍提交的测试集翻译结果文件，采用自动评价指标BLUE-4进行评价，具体工具使用sacrebleu开源版本。\n",
+    "\n",
+    "## 3.评测及排行\n",
+    "\n",
+    "1）提供下载数据，选手在本地进行算法调试，在比赛页面提交结果。\n",
+    "\n",
+    "2）排行按照得分从高到低排序，排行榜将选择团队的历史最优成绩进行排名。\n",
+    "\n",
+    "# 四、作品提交要求\n",
+    "\n",
+    "1、文件格式：txt格式，编码为UTF-8\n",
+    "\n",
+    "2、文件大小：无要求\n",
+    "\n",
+    "3、提交次数限制：每支队伍每天最多3次\n",
+    "\n",
+    "4、文件详细说明：\n",
+    "\n",
+    "1) 提交格式见图3样例及example.txt文件\n",
+    "2) 进入决赛的队伍须提交技术说明文档：系统主要技术概述、重要参数说明、外部技术说明（开源代码或软件）\n",
+    "\n",
+    "![img](https://openres.xfyun.cn/xfyundoc/2024-05-14/92c18ae0-a72f-47a8-9ebe-a87ac5577427/1715666008957/11.png)\n",
+    "\n",
+    "图3 翻译结果提交格式\n",
+    "\n",
+    "# 五、赛程规则\n",
+    "\n",
+    "本赛题实行一轮赛制\n",
+    "\n",
+    "## 【赛程周期】\n",
+    "\n",
+    "6月9日-8月9日\n",
+    "\n",
+    "1、6月9日10：00发布训练集、开发集、测试集（即开启比赛榜单）\n",
+    "\n",
+    "2、比赛作品提交截止日期为8月9日17：00，公布名次日期为8月16日10：00\n",
+    "\n",
+    "## 【现场答辩】\n",
+    "\n",
+    "1、最终前三名团队将受邀参加科大讯飞AI开发者大赛总决赛并于现场进行答辩\n",
+    "\n",
+    "2、答辩以（10mins陈述+5mins问答）的形式进行\n",
+    "\n",
+    "3、根据作品成绩和答辩成绩综合评分（作品成绩占比70％，现场答辩分数占比30％）\n",
+    "\n",
+    "# 六、奖项设置\n",
+    "\n",
+    "本赛题设立一、二、三等奖共三名，具体详情如下：\n",
+    "\n",
+    "## 【奖项激励】\n",
+    "\n",
+    "1. TOP3团队颁发获奖证书\n",
+    "2. 赛道奖金，第一名5000元、第二名3000元、第三名2000元\n",
+    "\n",
+    "## 【资源激励】\n",
+    "\n",
+    "1. 讯飞开放平台优质AI能力个人资源包\n",
+    "2. 讯飞AI全链创业扶持资源\n",
+    "3. 讯飞绿色实习/就业通道\n",
+    "\n",
+    "注：\n",
+    "\n",
+    "1. 鼓励选手分享参赛心得、参赛技术攻略、大赛相关技术或产品使用体验等文章至组委会邮箱（AICompetition@iflytek.com），有机会获得大赛周边；\n",
+    "2. 赛事规则及奖金发放解释权归科大讯飞所有；以上全部奖金均为税前金额，将由主办方代扣代缴个人所得税。"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "02b2d45a-d89d-4bdc-9752-f9dafd9ff48e",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/lyz/anaconda3/envs/py311/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "7d171b49156c4fd09bb2e1e05dee41e9",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "config.json:   0%|          | 0.00/371 [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/lyz/anaconda3/envs/py311/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "a5529ba7af574ab5850803dc7f33c8c1",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "b39f7bb4cb4e42848579cf34f9dcfdb5",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/lyz/anaconda3/envs/py311/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "782900272b5e4c8493fa4fc32bc1bd59",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "tokenizer_config.json:   0%|          | 0.00/474 [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "ae6be5aa1ca144a39904e3ab9fa1e0c9",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "vocab.json: 0.00B [00:00, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "8f3bc53dd4f64136a7459d506e2786a7",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "merges.txt: 0.00B [00:00, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "73a0ecec75294bcbbe61dd1ba386a51d",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "tokenizer.json: 0.00B [00:00, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n"
+     ]
+    }
+   ],
+   "source": [
+    "import os\n",
+    "os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\n",
+    "\n",
+    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
+    "device = \"cuda\" # the device to load the model onto\n",
+    "\n",
+    "# 预训练的开源大模型\n",
+    "# 使用比赛数据集从头训练\n",
+    "model = AutoModelForCausalLM.from_pretrained(\n",
+    "    \"Qwen/Qwen2-0.5B-Instruct\",\n",
+    "    torch_dtype=\"auto\",\n",
+    "    device_map=\"auto\"\n",
+    ")\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\"Qwen/Qwen2-0.5B-Instruct\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "id": "1a305540-17bc-4007-82b2-92a686bf1cde",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "dc = pd.read_csv('./dataset/en-zh.dic', sep='\\t', header=None).set_index(0).to_dict()[1]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "3d0e2ec7-3d26-49c6-aeac-6215635183d5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from tqdm import tqdm\n",
+    "lines = open('./dataset/test_en.txt').readlines()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "07161730-fc7c-4de5-9225-16c2f9b6b494",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      " 85%|████████▍ | 846/1000 [09:33<01:41,  1.51it/s]"
+     ]
+    }
+   ],
+   "source": [
+    "result = []\n",
+    "for line in tqdm(lines):\n",
+    "    sp_words = [x for x in line.lower().split() if x in dc.keys()]\n",
+    "    sp_words_meaning = [dc[x] for x in sp_words]\n",
+    "\n",
+    "    sp_prompt = '文章字符为：'\n",
+    "    if len(sp_words) > 0:\n",
+    "        for x, y in zip(sp_words, sp_words_meaning): \n",
+    "            sp_prompt += f'{x} 翻译为 {y}; '\n",
+    "    # 主要任务\n",
+    "    messages = [\n",
+    "        {\"role\": \"system\", \"content\": \"将英文翻译为中文，不要有其他输出，直接输出翻译后的文本。保留特殊单词的翻译。\"},\n",
+    "    ]\n",
+    "\n",
+    "    # 人工的词典的规则\n",
+    "    if len(sp_prompt) > 0:\n",
+    "        messages.append({\"role\": \"user\", \"content\": sp_prompt})\n",
+    "    messages.append({\"role\": \"user\", \"content\": f\"待翻译文本（从英文翻译为中文）：{line}\"})\n",
+    "    \n",
+    "    text = tokenizer.apply_chat_template(\n",
+    "        messages,\n",
+    "        tokenize=False,\n",
+    "        add_generation_prompt=True\n",
+    "    )\n",
+    "    model_inputs = tokenizer([text], return_tensors=\"pt\").to(device)\n",
+    "    \n",
+    "    generated_ids = model.generate(\n",
+    "        model_inputs.input_ids,\n",
+    "        max_new_tokens=512\n",
+    "    )\n",
+    "    generated_ids = [\n",
+    "        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)\n",
+    "    ]\n",
+    "    \n",
+    "    result_line = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]\n",
+    "    result.append(result_line)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 79,
+   "id": "54170caf-6c19-48b9-87fe-5d25f037ff96",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with open('submit1.csv', 'w') as up:\n",
+    "    for line in result:\n",
+    "        line = line.strip().replace('\\n', '')\n",
+    "        up.write(line + '\\n')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d79ce76c-7407-4eda-a3e5-d8aaaa004154",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "py3.11",
+   "language": "python",
+   "name": "py3.11"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}