-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #431 from shibing624/llm
Llm
- Loading branch information
Showing
196 changed files
with
35,939 additions
and
283,559 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
# Solution | ||
|
||
### 规则的解决思路 | ||
依据语言模型检测错别字位置,通过拼音音似特征、笔画五笔编辑距离特征及语言模型困惑度特征纠正错别字。 | ||
|
||
1. 中文纠错分为两步走,第一步是错误检测,第二步是错误纠正; | ||
2. 错误检测部分先通过结巴中文分词器切词,由于句子中含有错别字,所以切词结果往往会有切分错误的情况,这样从字粒度和词粒度两方面检测错误, 整合这两种粒度的疑似错误结果,形成疑似错误位置候选集; | ||
3. 错误纠正部分,是遍历所有的疑似错误位置,并使用音似、形似词典替换错误位置的词,然后通过语言模型计算句子困惑度,对所有候选集结果比较并排序,得到最优纠正词。 | ||
|
||
### 深度模型的解决思路 | ||
|
||
1. 端到端的深度模型可以避免人工提取特征,减少人工工作量,RNN序列模型对文本任务拟合能力强,RNN Attn在英文文本纠错比赛中取得第一名成绩,证明应用效果不错; | ||
2. CRF会计算全局最优输出节点的条件概率,对句子中特定错误类型的检测,会根据整句话判定该错误,阿里参赛2016中文语法纠错任务并取得第一名,证明应用效果不错; | ||
3. Seq2Seq模型是使用Encoder-Decoder结构解决序列转换问题,目前在序列转换任务中(如机器翻译、对话生成、文本摘要、图像描述)使用最广泛、效果最好的模型之一; | ||
4. BERT/ELECTRA/ERNIE/MacBERT等预训练模型强大的语言表征能力,对NLP届带来翻天覆地的改变,海量的训练数据拟合的语言模型效果无与伦比,基于其MASK掩码的特征,可以简单改造预训练模型用于纠错,加上fine-tune,效果轻松达到最优。 | ||
|
||
PS: | ||
|
||
- [作者纠错分享](https://github.com/shibing624/pycorrector/wiki/pycorrector%E6%BA%90%E7%A0%81%E8%A7%A3%E8%AF%BB-%E7%9B%B4%E6%92%AD%E5%88%86%E4%BA%AB) | ||
- [网友源码解读](https://zhuanlan.zhihu.com/p/138981644) |
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
""" | ||
Convert alpaca dataset into sharegpt format. | ||
Usage: python convert_alpaca.py --in_file alpaca_data.json --out_file alpaca_data_sharegpt.json | ||
""" | ||
|
||
import argparse | ||
|
||
from datasets import load_dataset | ||
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument("--in_file", type=str, required=True) | ||
parser.add_argument("--out_file", type=str, required=True) | ||
parser.add_argument("--data_type", type=str, default='alpaca') | ||
args = parser.parse_args() | ||
print(args) | ||
data_files = {"train": args.in_file} | ||
raw_datasets = load_dataset('json', data_files=data_files) | ||
ds = raw_datasets['train'] | ||
|
||
system_prompt = "对这个句子语法纠错" | ||
|
||
|
||
def process_alpaca(examples): | ||
convs = [] | ||
for instruction, inp, output in zip(examples['instruction'], examples['input'], examples['output']): | ||
if len(inp.strip()) > 1: | ||
instruction = system_prompt + '\n\n' + inp | ||
q = instruction | ||
a = output | ||
convs.append([ | ||
{"from": "human", "value": q}, | ||
{"from": "gpt", "value": a} | ||
]) | ||
return {"conversations": convs} | ||
|
||
|
||
if args.data_type in ['alpaca']: | ||
ds = ds.map(process_alpaca, batched=True, remove_columns=ds.column_names, desc="Running process") | ||
else: | ||
# Other sharegpt dataset, need rename to conversations and remove unused columns | ||
if "items" in ds.column_names: | ||
ds = ds.rename(columns={"items": "conversations"}) | ||
columns_to_remove = ds.column_names.copy() | ||
columns_to_remove.remove('conversations') | ||
ds = ds.remove_columns(columns_to_remove) | ||
|
||
ds.to_json(f"{args.out_file}", lines=True, force_ascii=False) |
Oops, something went wrong.