-
Notifications
You must be signed in to change notification settings - Fork 35
Description
I've been studying your work on DiffuScene and find the approach very insightful. While reviewing the source code, I noticed a specific architectural choice for the Unet1D denoiser that I was curious about.
In the provided configuration files, the network consistently uses fixed dimension multipliers (dim_mults=[1, 1, 1, 1]) paired with a high base dimension (dim=512). This is a departure from the more common pyramidal U-Net structure (e.g., [1, 2, 4, 8]) which creates an information bottleneck at its deepest level.
I was wondering if you could elaborate on the design rationale for this fixed-width architecture. My initial thought was that this might be to better preserve the high-dimensional information of the object attribute set at every level of the network, which could be particularly beneficial for the attention mechanisms modeling inter-object relationships.
Did you experiment with a more conventional pyramidal U-Net, and if so, how did its performance compare?