Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficiently loading Waymo raw data? #856

Open
aryasenna opened this issue Jul 5, 2024 · 4 comments
Open

Efficiently loading Waymo raw data? #856

aryasenna opened this issue Jul 5, 2024 · 4 comments

Comments

@aryasenna
Copy link

aryasenna commented Jul 5, 2024

Hello,

I'm using v2.0.0 dataset and successfully followed the example on loading Waymo data using dask.

This is all fine for quick testing, but when I use the same method on my data loader things do not scale so well. Dask is nice but when I actually call Dask's compute() to get the data, it takes sometime even with fast disk.

Data loader when shuffled will sample the frame randomly so I can't have eager loading of each parquet file by looking at its order.

Example: when it happens that the the loader samples 10 different frames from 10 different parquet files, then it becomes an I/O bottleneck, even with multiple workers.

Preloading the whole dataset is out of question due to memory constraint.

I have been looking at how other frameworks (e.g. Mmdetect) and 3rd party libraries (e.g. Pytorch Waymo loader) are using Waymo: they pre-convert the training frames (e.g. to pickle) so the access time is fast even when frames are randomly sampled.

Is this the recommend way? I feel the use of parquet file + dask is meant to address this exact issue.

Thanks in advance for the insight.

@aryasenna
Copy link
Author

aryasenna commented Jul 13, 2024

In case anyone is wondering:

One possible solution, depending on your use case is to use "push down filtering".

Too bad that the Waymo v2 example/tutorial never mentioned the use of Dask's filtering.

The filtering should be done when you first read your parquet file:

e.g.

image_df = dd.read_parquet(
    os.path.join(directories['CameraImage'], context_name + '.parquet'),
    columns=['key.frame_timestamp_micros', 'key.camera_name', '[CameraImageComponent].image'],
    filters=[('key.frame_timestamp_micros', '==', timestamp), ('key.camera_name', '==', CameraName.FRONT.value)]
)

This approach works for me because my training loader only expects one timestamp and a certain camera. The idea is to make sure you only load part of the parquet.

I will leave the issue open for visibility in case the Waymo team wants to update their documentation.

@JingweiJ
Copy link
Collaborator

Yes push down filtering is a good way for efficiency. We mentioned this a bit in the "A relational database-like structure" section in the example/tutorial yet we should discuss more in the aspect of efficiency. Thanks for the advice!

@aryasenna
Copy link
Author

aryasenna commented Jul 15, 2024

@JingweiJ Thanks for checking this issue. Yes you're correct pushdown filtering there in the short comment. So technically, it was "mentioned".

My point being, in the actual example code where it only uses single frame, it makes sense to use push filtering by default. 🙂

@nlgranger
Copy link

nlgranger commented Oct 23, 2024

In case it can help anyone, I have written a library which can load any given data sample from Waymo (and also KITTI, NuScenes or ZOD). As we discussed in #841, one needs to re-encode the parquet files to make random access fast.

The library is here: https://github.com/CEA-LIST/tri3d

It is a bit opinionated because I needed to settle on common conventions across datasets, but I think you'll find it does what you expect it to do most of the time. Notably, it has sane defaults to interpolate poses (ego car, boxes, sensors) such that when you request something at, say, LiDAR frame 12, that something will actually overlap well with the point cloud.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants