Merge pull request #431 from shibing624/llm

Llm
shibing624 · Nov 7, 2023 · 193eb96 · 193eb96
2 parents 751da4e + 0ba3b5d
commit 193eb96
Show file tree

Hide file tree

Showing 196 changed files with 35,939 additions and 283,559 deletions.
diff --git a/README.md b/README.md
diff --git a/docs/correction_solution.md b/docs/correction_solution.md
@@ -0,0 +1,20 @@
+# Solution
+
+### 规则的解决思路
+依据语言模型检测错别字位置，通过拼音音似特征、笔画五笔编辑距离特征及语言模型困惑度特征纠正错别字。
+
+1. 中文纠错分为两步走，第一步是错误检测，第二步是错误纠正；
+2. 错误检测部分先通过结巴中文分词器切词，由于句子中含有错别字，所以切词结果往往会有切分错误的情况，这样从字粒度和词粒度两方面检测错误， 整合这两种粒度的疑似错误结果，形成疑似错误位置候选集；
+3. 错误纠正部分，是遍历所有的疑似错误位置，并使用音似、形似词典替换错误位置的词，然后通过语言模型计算句子困惑度，对所有候选集结果比较并排序，得到最优纠正词。
+
+### 深度模型的解决思路
+
+1. 端到端的深度模型可以避免人工提取特征，减少人工工作量，RNN序列模型对文本任务拟合能力强，RNN Attn在英文文本纠错比赛中取得第一名成绩，证明应用效果不错；
+2. CRF会计算全局最优输出节点的条件概率，对句子中特定错误类型的检测，会根据整句话判定该错误，阿里参赛2016中文语法纠错任务并取得第一名，证明应用效果不错；
+3. Seq2Seq模型是使用Encoder-Decoder结构解决序列转换问题，目前在序列转换任务中（如机器翻译、对话生成、文本摘要、图像描述）使用最广泛、效果最好的模型之一；
+4. BERT/ELECTRA/ERNIE/MacBERT等预训练模型强大的语言表征能力，对NLP届带来翻天覆地的改变，海量的训练数据拟合的语言模型效果无与伦比，基于其MASK掩码的特征，可以简单改造预训练模型用于纠错，加上fine-tune，效果轻松达到最优。
+
+PS：
+
+- [作者纠错分享](https://github.com/shibing624/pycorrector/wiki/pycorrector%E6%BA%90%E7%A0%81%E8%A7%A3%E8%AF%BB-%E7%9B%B4%E6%92%AD%E5%88%86%E4%BA%AB)
+- [网友源码解读](https://zhuanlan.zhihu.com/p/138981644)
diff --git a/examples/base_demo.py b/examples/base_demo.py
diff --git a/examples/bert_demo.py b/examples/bert_demo.py
diff --git a/examples/data/grammar/convert_dataset.py b/examples/data/grammar/convert_dataset.py
@@ -0,0 +1,49 @@
+"""
+Convert alpaca dataset into sharegpt format.
+
+Usage: python convert_alpaca.py --in_file alpaca_data.json --out_file alpaca_data_sharegpt.json
+"""
+
+import argparse
+
+from datasets import load_dataset
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--in_file", type=str, required=True)
+    parser.add_argument("--out_file", type=str, required=True)
+    parser.add_argument("--data_type", type=str, default='alpaca')
+    args = parser.parse_args()
+    print(args)
+    data_files = {"train": args.in_file}
+    raw_datasets = load_dataset('json', data_files=data_files)
+    ds = raw_datasets['train']
+
+    system_prompt = "对这个句子语法纠错"
+
+
+    def process_alpaca(examples):
+        convs = []
+        for instruction, inp, output in zip(examples['instruction'], examples['input'], examples['output']):
+            if len(inp.strip()) > 1:
+                instruction = system_prompt + '\n\n' + inp
+            q = instruction
+            a = output
+            convs.append([
+                {"from": "human", "value": q},
+                {"from": "gpt", "value": a}
+            ])
+        return {"conversations": convs}
+
+
+    if args.data_type in ['alpaca']:
+        ds = ds.map(process_alpaca, batched=True, remove_columns=ds.column_names, desc="Running process")
+    else:
+        # Other sharegpt dataset, need rename to conversations and remove unused columns
+        if "items" in ds.column_names:
+            ds = ds.rename(columns={"items": "conversations"})
+        columns_to_remove = ds.column_names.copy()
+        columns_to_remove.remove('conversations')
+        ds = ds.remove_columns(columns_to_remove)
+
+    ds.to_json(f"{args.out_file}", lines=True, force_ascii=False)