Skip to content

Conversation

@scyyh11
Copy link
Contributor

@scyyh11 scyyh11 commented Nov 6, 2025

PR 描述

在测试 Jarvis-DFT3D 系列数据集时,其余config文件均可正常跑通,仅有
megnet_jarvis_dft_3d_2021_bulk_modulus.yaml 在运行过程中会触发如下错误:

Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/aistudio/external-libraries/lib/python3.10/site-packages/paddle/io/dataloader/dataloader_iter.py", line 249, in _thread_loop
    batch = self._dataset_fetcher.fetch(
  File "/home/aistudio/external-libraries/lib/python3.10/site-packages/paddle/io/dataloader/fetcher.py", line 77, in fetch
    data.append(self.dataset[idx])
  File "/home/aistudio/external-libraries/lib/python3.10/site-packages/paddle/io/dataloader/dataset.py", line 522, in __getitem__
    return self.dataset[self.indices[idx]]
  File "/home/aistudio/PaddleMaterials/ppmat/datasets/jarvis_dataset.py", line 984, in __getitem__
    ).astype("float32")
ValueError: could not convert string to float: 'na'

经排查,该问题由数据集中存在非法字符串(如 'na')引起,而当前 JarvisDataset 对非法或缺失数据的处理逻辑不够完善。
本次 PR 对数据过滤与运行时验证进行了增强,确保无效数据能够在初始化阶段被正确剔除,并在训练过程中提供更清晰的错误提示。


1. 属性过滤逻辑增强(filter_unvalid_by_property

修改前:

  • 原逻辑中字符串类型的非法值(如 'na''none' 等)未被正确识别,导致错误保留。
  • 非数值字符串未被跳过,训练阶段可能触发类型转换异常。

修改后:

  • 显式处理常见的缺失值字符串('na''nan''none''' 等)。
  • 将其统一转换为 np.nan 并在过滤阶段剔除。
  • 非数值字符串直接跳过,避免被错误计入有效样本。

代码修改如下:

# Convert 'na' strings to NaN for proper filtering
if isinstance(data_item, str):
    if data_item.lower() in ['na', 'nan', 'none', '']:
        data_item = np.nan
    else:
        # Skip non-numeric strings (they're invalid for numeric properties)
        continue
# Keep only valid numeric values (not None, not NaN)
if data_item is not None and not math.isnan(data_item):
    reserve_idx.append(i)

该修改确保在数据加载阶段即可清除非法字符串或缺失值,提高了过滤的鲁棒性与一致性。


2. 新增运行时样本校验(__getitem__

修改前:

  • 访问样本时未进行有效性检查,部分无效值在训练或推理阶段才触发异常。
  • 报错信息缺乏上下文,不利于问题定位。

修改后:

  • __getitem__ 中增加运行时校验逻辑,对 'na' 字符串与 NaN 值进行检测。
  • 一旦发现异常值,会立即抛出包含索引、属性名及修复建议的详细错误提示。
  • 便于快速定位问题数据并给出解决方案。

代码修改如下:

value = self.property_data[property_name][idx]
# Check for 'na' strings - these should have been filtered out during initialization
if isinstance(value, str) and value.lower() in ['na', 'nan', 'none', '']:
    raise ValueError(
        f"Found invalid property value '{value}' at index {idx} for property "
        f"'{property_name}'. This should have been filtered out during dataset "
        f"initialization. Please ensure 'filter_unvalid=True' is set and "
        f"consider clearing the cache to regenerate filtered data."
    )
# Check for NaN values - these should also have been filtered out
if value is not None and (isinstance(value, float) and math.isnan(value)):
    raise ValueError(
        f"Found NaN value at index {idx} for property '{property_name}'. "
        f"This should have been filtered out during dataset initialization."
    )

该机制能在数据访问阶段及时捕获潜在异常,提供清晰可追溯的错误信息,显著提升调试效率。


验证结果

  • 测试数据集:megnet_jarvis_dft_3d_2021_bulk_modulus.yaml
  • 修复后:数据加载、训练和推理流程均正常,无报错。
  • 其他 Jarvis-DFT3D 数据集(如 formation_energy、band_gap 等)验证通过,未受影响。

@leeleolay @luotao1

@paddle-bot
Copy link

paddle-bot bot commented Nov 6, 2025

Thanks for your contribution!

Copy link
Collaborator

@leeleolay leeleolay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LSTM

@leeleolay leeleolay merged commit e73ec81 into PaddlePaddle:develop Nov 12, 2025
1 check passed
Wei-jie-Wu pushed a commit to Wei-jie-Wu/PaddleMaterials that referenced this pull request Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants