Efficiently loading Waymo raw data? #856

aryasenna · 2024-07-05T17:42:07Z

Hello,

I'm using v2.0.0 dataset and successfully followed the example on loading Waymo data using dask.

This is all fine for quick testing, but when I use the same method on my data loader things do not scale so well. Dask is nice but when I actually call Dask's compute() to get the data, it takes sometime even with fast disk.

Data loader when shuffled will sample the frame randomly so I can't have eager loading of each parquet file by looking at its order.

Example: when it happens that the the loader samples 10 different frames from 10 different parquet files, then it becomes an I/O bottleneck, even with multiple workers.

Preloading the whole dataset is out of question due to memory constraint.

I have been looking at how other frameworks (e.g. Mmdetect) and 3rd party libraries (e.g. Pytorch Waymo loader) are using Waymo: they pre-convert the training frames (e.g. to pickle) so the access time is fast even when frames are randomly sampled.

Is this the recommend way? I feel the use of parquet file + dask is meant to address this exact issue.

Thanks in advance for the insight.

aryasenna · 2024-07-13T13:45:05Z

In case anyone is wondering:

One possible solution, depending on your use case is to use "push down filtering".

Too bad that the Waymo v2 example/tutorial never mentioned the use of Dask's filtering.

The filtering should be done when you first read your parquet file:

e.g.

image_df = dd.read_parquet(
    os.path.join(directories['CameraImage'], context_name + '.parquet'),
    columns=['key.frame_timestamp_micros', 'key.camera_name', '[CameraImageComponent].image'],
    filters=[('key.frame_timestamp_micros', '==', timestamp), ('key.camera_name', '==', CameraName.FRONT.value)]
)

This approach works for me because my training loader only expects one timestamp and a certain camera. The idea is to make sure you only load part of the parquet.

I will leave the issue open for visibility in case the Waymo team wants to update their documentation.

JingweiJ · 2024-07-15T09:24:14Z

Yes push down filtering is a good way for efficiency. We mentioned this a bit in the "A relational database-like structure" section in the example/tutorial yet we should discuss more in the aspect of efficiency. Thanks for the advice!

aryasenna · 2024-07-15T11:17:24Z

@JingweiJ Thanks for checking this issue. Yes you're correct pushdown filtering there in the short comment. So technically, it was "mentioned".

My point being, in the actual example code where it only uses single frame, it makes sense to use push filtering by default. 🙂

nlgranger · 2024-10-23T11:15:38Z

In case it can help anyone, I have written a library which can load any given data sample from Waymo (and also KITTI, NuScenes or ZOD). As we discussed in #841, one needs to re-encode the parquet files to make random access fast.

The library is here: https://github.com/CEA-LIST/tri3d

It is a bit opinionated because I needed to settle on common conventions across datasets, but I think you'll find it does what you expect it to do most of the time. Notably, it has sane defaults to interpolate poses (ego car, boxes, sensors) such that when you request something at, say, LiDAR frame 12, that something will actually overlap well with the point cloud.

aryasenna mentioned this issue Jul 5, 2024

Optimize parquet files (sorted columns and row group size) #841

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficiently loading Waymo raw data? #856

Efficiently loading Waymo raw data? #856

aryasenna commented Jul 5, 2024 •

edited

Loading

aryasenna commented Jul 13, 2024 •

edited

Loading

JingweiJ commented Jul 15, 2024

aryasenna commented Jul 15, 2024 •

edited

Loading

nlgranger commented Oct 23, 2024 •

edited

Loading

Efficiently loading Waymo raw data? #856

Efficiently loading Waymo raw data? #856

Comments

aryasenna commented Jul 5, 2024 • edited Loading

aryasenna commented Jul 13, 2024 • edited Loading

JingweiJ commented Jul 15, 2024

aryasenna commented Jul 15, 2024 • edited Loading

nlgranger commented Oct 23, 2024 • edited Loading

aryasenna commented Jul 5, 2024 •

edited

Loading

aryasenna commented Jul 13, 2024 •

edited

Loading

aryasenna commented Jul 15, 2024 •

edited

Loading

nlgranger commented Oct 23, 2024 •

edited

Loading