New promising architecture: Hourglass Diffusion Transformers
New promising architecture: Hourglass Diffusion Transformers
This is an automated archive made by the Lemmit Bot.
The original was posted on /r/stablediffusion by /u/jslominski on 2024-01-23 08:32:15.
From the paper: "Instead of treating images the same regardless of resolution, this architecture adapts to the target resolution, processing local phenomena locally at high resolutions and separately processing global phenomena in low-resolution parts of the hierarchy. This yields an architecture whose computational complexity scales with O(n) when used at higher resolutions instead of O(n²), bridging the gap between the excellent scaling properties of transformer models and the efficiency of U-Nets. We demonstrate that this architecture enables megapixel-scale pixel-space diffusion models without requiring tricks such as self-conditioning or multiresolution architectures and that it is competitive with other transformer diffusion backbones even at small resolutions, both in fairly matched pixel-space settings, where it is substantially more efficient, and when compared to transformers in latent diffusion setups."