Question about fixed dim mults for U-Net

I've been studying your work on DiffuScene and find the approach very insightful. While reviewing the source code, I noticed a specific architectural choice for the Unet1D denoiser that I was curious about.

In the provided configuration files, the network consistently uses fixed dimension multipliers (dim_mults=[1, 1, 1, 1]) paired with a high base dimension (dim=512). This is a departure from the more common pyramidal U-Net structure (e.g., [1, 2, 4, 8]) which creates an information bottleneck at its deepest level.

I was wondering if you could elaborate on the design rationale for this fixed-width architecture. My initial thought was that this might be to better preserve the high-dimensional information of the object attribute set at every level of the network, which could be particularly beneficial for the attention mechanisms modeling inter-object relationships.

Did you experiment with a more conventional pyramidal U-Net, and if so, how did its performance compare?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about fixed dim mults for U-Net #62

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question about fixed dim mults for U-Net #62

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions