You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
按照readme复现ERNIE-layout微调,由于机器处于离线环境,因此FUNSD与xfund_zh两个数据集通过wget方式下载。
在FUNSD和xfund_zh数据集上目前出现3个问题:
(1)按照原代码和命令行指令出现报错:
Traceback (most recent call last):
File "run_ner.py", line 235, in <module>
main(filename)
File "run_ner.py", line 75, in main
label_list, label_to_id = get_label_ld(train_ds["qas"], scheme=data_args.pattern.split("-")[1])
File "/home/.../data/model/ERNIE-layout/utils.py", line 135, in get_label_ld
for key in qa["question"]:
TypeError: list indices must be integers or slices, not str
目前我猜测是数据集格式的问题,因此调整utils.py中get_label_ld的代码(代码放到下面一节)后暂时解决
(2)在修改(1)的代码后,出现报错:
Traceback (most recent call last):
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 1347, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3474, in _map_single
batch = apply_function_on_filtered_inputs(
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3353, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/home/.../data/model/ERNIE-layout/utils.py", line 270, in preprocess_ner
packed_QA = zip(qas["question"], qas["answers"])
TypeError: list indices must be integers or slices, not str
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "run_ner.py", line 235, in <module>
main(filename)
File "run_ner.py", line 121, in main
train_dataset = train_ds.map(
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 592, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3189, in map
for rank, done, content in iflatmap_unordered(
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 1387, in iflatmap_unordered
[async_result.get(timeout=0.05) for async_result in async_results]
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 1387, in <listcomp>
[async_result.get(timeout=0.05) for async_result in async_results]
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/multiprocess/pool.py", line 771, in get
raise self._value
TypeError: list indices must be integers or slices, not str
目前猜测是数据格式问题,于是尝试修改utils.py中的preprocess_ner的代码(如下节所示),暂时解决
(3)修改前两个问题的代码后,再次运行出现报错:
Traceback (most recent call last):
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 1347, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3474, in _map_single
batch = apply_function_on_filtered_inputs(
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3353, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/home/.../data/model/ERNIE-layout/utils.py", line 403, in preprocess_ner
feature_id = examples["name"][example_idx] + "__" + str(examples["page_no"][example_idx])
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/formatting/formatting.py", line 270, in __getitem__
value = self.data[key]
KeyError: 'page_no'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "run_ner.py", line 235, in <module>
main(filename)
File "run_ner.py", line 121, in main
train_dataset = train_ds.map(
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 592, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3189, in map
for rank, done, content in iflatmap_unordered(
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 1387, in iflatmap_unordered
[async_result.get(timeout=0.05) for async_result in async_results]
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 1387, in <listcomp>
[async_result.get(timeout=0.05) for async_result in async_results]
File "/home/.../miniconda3/envs/ernie-dev/lib/python3.8/site-packages/multiprocess/pool.py", line 771, in get
raise self._value
KeyError: 'page_no'
核对FUNSD和xfund_zh数据集,发现所有样本均没有'page_no'字段,因此猜测目前下载的数据集并非最终微调可用的数据集。
请帮忙核对一下数据集是否有问题(下载地址是https://bj.bcebos.com/paddlenlp/datasets/funsd.tar.gz以及https://bj.bcebos.com/paddlenlp/datasets/xfund_zh.tar.gz),或者能否提供一下数据预处理相关代码,谢谢!
稳定复现步骤 & 代码
报错1截图:
修改1:
报错2截图:
修改2:
报错3截图:
The text was updated successfully, but these errors were encountered:
Mercurialzs
changed the title
[Bug]: 在Funsd和xfund_zh数据集上复现ERNIE-layout微调发现数据每条样本都没有page_no
[Bug]: 源码提供的Funsd和xfund_zh数据集缺少字段导致复现ERNIE-layout微调中多处报错
Aug 24, 2023
软件环境
重复问题
错误描述
稳定复现步骤 & 代码
报错1截图:
修改1:
报错2截图:
修改2:
报错3截图:
The text was updated successfully, but these errors were encountered: