MLNews

Image Generation: High-Resolution Images with Diffusion Models.

Dive into a universe where pixels have a heart and a soul.  This research reveals the wonder of image creation, bringing pixels to life and pushing the limits of imagination. They are on a mission to make your visual ideas a reality by producing wonderfully high-resolution photographs with unparalleled artistic flexibility. Hong Kong University of Science and Technology, Chinese Academy of Sciences, and Tencent AI Lab are involved in the research study of this model.

In their research, they investigate the potential of pre-trained diffusion models to generate high-resolution images that outperform the resolution of the training images. Furthermore, they intend to make photos with a variety of aspect ratios. When they use pre-trained Stable Diffusion models trained on 512×512 images to generate images directly at a higher resolution, such as 1024×1024, they experience ongoing issues such as object occurrences and unnatural object structures. 

Existing methods for creating higher-resolution images, such as those based on attention and joint diffusion, struggle to overcome these issues. They investigate the structural components of the U-Net in diffusion models from a new angle and find the fundamental cause as the small perception field of multi-layer kernels. Building on this critical finding, they present re-dilation, as a simple yet effective method.

During image production, this approach dynamically modifies the convolutional perception field. Furthermore, they present dispersed convolution and noise-damped classifier-free guidance, which allow researchers to generate ultra-high-resolution images, such as 4096×4096, without additional training or optimization.

Importantly, their method successfully reduces the repetition problem and delivers excellent results in higher-resolution picture synthesis, particularly in keeping fine texture features. Their findings also mean that pre-trained diffusion models taught on low-resolution pictures can be used directly for high-resolution visual production, which provides important insights for future study in ultra-high-resolution image and video synthesis.

Prior related works and their limitations

Text-to-image synthesis has recently received a lot of interest due to its outstanding generation capabilities. Diffusion models are preferred among the numerous generative models due to their ability to generate high-quality images. DDPM Denoising diffusion probabilistic models initially set the groundwork for diffusion models in picture production, and subsequent research has built on that basis. Because of their efficient use of a compact latent space, latent diffusion models (LDM) have grown in popularity.

Furthermore, SD models, an extension of LDMs, have been open-sourced, providing great sample quality and originality. Despite their outstanding synthesis capabilities, the resulting images have a critical limitation: they are limited to the same resolution as the training data. For example, SD 2.1 has a resolution restriction of 512×512, and SD XL has a resolution limit of 1024×1024, making it difficult to make higher-quality photos.

The first row shows re-dilation in high-resolution images

High-Resolution Synthesis and Adaptation:

Because of the complexity of dealing with higher-dimensional data and the large computer resources required, creating high-resolution photographs offers significant obstacles. Previous efforts can be divided into two categories: starting from scratch and fine-tuning. They recently introduced a training-free strategy for variable-sized adaptation, however, it falls short of higher-resolution generation.

To avoid variations between image parts, Multi-Diffusion and SyncDiffusion have concentrated on smoothing the overlap region. Both approaches, however, struggle with object recurrence in their outputs. Multi-diffusion can reduce duplication by using user-supplied conditions such as areas and text, but these extra inputs are not easily accessible in the context of text-to-image production.

Introduction to high-resolution image generation model

There has been a remarkable rise in the development of image synthesis during the last two years, capturing the attention of both academics and industrial actors. Text-to-image generation models such as Stable Diffusion (SD), SD-XL, Midjourney (Mid), and IF have gained popularity. However, these models have limitations: the largest resolution they can manage is 1024×1024, which is insufficient for applications such as ads.

If you try to generate images at resolutions greater than those used to train these models, you will encounter difficulties such as recurring objects and unnatural structures. When utilizing an SD model trained on 512×512 photos to create 512×1024 or 1024×1024 images, for example, you’ll observe that object recurrence becomes more obvious as the image size grows.

 Structure repetition issue of high-resolution image generation

The researchers tested photos at various resolutions to study the pattern of repetition and found that while images at higher resolutions lack blurriness, their object structures decreased. This means that pre-trained SD models might be able to produce photos with higher resolutions without sacrificing image quality.

Convolution and self-attention were two structural elements of SD that the researchers looked more closely at. Surprisingly, they discovered that they could minimize object recurrence while repetition in local edges by switching from regular convolution to dilated convolution throughout the full U-Net with pre-trained parameters. They performed additional analysis, taking into account U-Net blocks, timesteps, and dilation radius, to determine when and how to use dilated convolution.

After conducting their research, the researchers developed a tuning-free dynamic re-dilation technique to address the repeating problem. In order to facilitate the creation of ultra-high-resolution images, they also put forth unique strategies like distributed convolution and noise-damped classifier-free guidance.

Our method can generate 4096 × 4096 images, 16× higher than the training resolution

Future scope of high-resolution image generation

This study’s future potential lies in a number of possibilities. First, increasing the effectiveness of converting already trained models to even higher resolutions, opens up a wider range of applications. Second, investigate how this strategy might be used to handle particular problems, such as handling video resolutions and improving real-time generating capabilities.

Moreover, it looks into how this approach can be applied in areas other than text-to-image conversion and video synthesis. As a last step, improving the process of fine-tuning and investigating methods to reduce any remaining artifacts in extremely high-resolution synthesis will help advance the field of generative models and AI applications.

In-depth Research study and code availability

The research paper of this study is available on Arxiv. The implementation code of this study is also freely available on GitHub. These resources are open-source and freely accessible to the public.

Potential applications

There are several potential uses for the research on modifying pre-trained models to produce higher-resolution photos and videos in a variety of industries. This technology can revolutionize content creation in the entertainment and media industries by enabling the development of ultra-high resolution videos and images, improving the visual appeal of films, video games, and virtual reality experiences.

Additionally, in the advertising sector, being able to swiftly develop high-resolution, customized visual content can result in more compelling and effective marketing campaigns. Technology can be used to create high-quality medical scans and diagnostic images in the field of medical imaging, assisting in the early identification and precise diagnosis of a variety of illnesses.

Qualitative ablation results

Additionally, the creation of ultra-high-resolution photographs can enable accurate simulations and virtual tours of architectural projects, optimizing the planning and presentation of ideas. The use of remote sensing and satellite imaging further expands applications, allowing for the creation of better, more accurate Earth observations with significant consequences for climatic monitoring, disaster management, and urban planning.

In conclusion, pre-trained models’ adaptability to produce higher-resolution content has enormous potential to change entire industries and raise the caliber of visual media, research, and applications.

Methods used in high-resolution image generation

The methods used in this model generation are problem formulation and motivation, Re-Dilation, Convolution Dispersion, and Noise-Damped Classifier-free Guidance. The following methods are discussed in detail

Problem Formulation and motivation

They discuss the difficulty of modifying diffusion models to produce higher-resolution images without the need for training in this work. They want to modify a base diffusion model that was developed using preset low-resolution images in order to synthesize higher-resolution images.

This issue has already been addressed by scaling features in the self-attention layer according to input resolution. However, the issue of object recurrence in the generated 1024×1024 photos was not completely resolved by this method. They observed that the local organization of repeated objects was reasonable and that the main issue was the growth in the proportion of repetitive objects as resolution increased. They then looked at whether the receptive field of any network components was inadequate for higher resolutions.

left: Samples by increasing perception field in middle blocks and most blocks. right: The first row shows the predicted original sample using noise-damped classifier-free
guidance

Re-dilation

Re-dilation was a method they introduced to deal with these problems. Re-dilation attempts to match the lower-resolution generation’s network receptive field to that of a higher-resolution image. It involves splitting the feature map into slices, each of which is obtained using feature dilation and feeding these slices simultaneously into the QKV attention.

Although the results were initially encouraging, they discovered that simply maintaining the receptive field of attention did not result in appreciable advancements. Instead, they were able to correctly adjust the number of objects, though with some artifacts still present, by extending the receptive area of convolution in all U-Net blocks. As a result of their following studies, they have created a more thorough re-dilation technique that takes into account when, when, and how to use dilated convolution.

Convolution dispersion

When modifying a diffusion model to provide significantly improved resolution, re-dilated convolution ran into the issue of periodic sampling reduction, which led to artifacts. They suggested convolution dispersion as a solution to this. By spreading the convolution kernel of a pretrained convolution layer, this technique expands its receptive field. The convolution kernel was expanded while preserving its original capabilities using structure-level and pixel-level calibration techniques. Consequently, a substantially wider perceptual field was achieved without experiencing periodic sub-sampling issues.

Noise-Damped Classifier-Free Guidance

The outer blocks in the denoising U-Net had to enlarge the sense of field in order to sample at much higher resolutions, which had an impact on the model’s capacity for denoising. They suggested noise-damped classifier-free guidance to address this. This method combines two model priors, one that has excellent denoising capabilities and another that builds picture content structures by utilizing re-dilated or distributed convolution in most blocks. These estimations are combined linearly with a guidance scale to perform the sampling process. This method guarantees efficient denoising during sampling and produces accurate object structures.

 Visual comparisons with SD-SR

Experiments and Evaluation of Model

In their studies, they evaluated text-to-image models, specifically Stable Diffusion (SD), including its three iterations: SD 1.5, SD 2.1, and SD XL 1.0, for assuming four unknown higher resolutions. From four to sixteen times the training’s pixel count, they tested four resolution levels. For instance, they tested the models on resolutions like 1024×1024, 1280×1280, 2048×1024, and 2048×2048 when training on 512×512 images. They also used their method to create a text-to-video model with a two times greater resolution.

They compared their approach with a tuning-free approach as well as standard text-to-image diffusion models. Their results regularly surpassed the benchmarks, showing that their approach more effectively protects the original generating capability of pre-trained diffusion models. Visual comparisons demonstrated that their method produced more believable architecture and extraordinarily realistic textures.

Visual comparisons between ours, directly inferencing SD and Attn-SF  in 4×, 8× and 16× settings and three Stable Diffusion models.

They used their technique on the pre-trained text-to-video model LVDM to determine how well it generalizes to video generation models. Quantitative results utilizing measures like Frechet Video Distance (FVD) and Kernel Video Distance (KVD) showed that their method successfully produced higher-resolution videos without decreasing image quality.

Conclusion

They look at the idea of sampling images at a resolution that is far higher than the training resolution of diffusion models that have already been trained. Directly sampling a higher-resolution image maintains image definition but has a serious repetition problem with objects. They go into the SD U-Net’s architecture and investigate the receptive field of its parts.

Fortunately, they find that sampling higher-resolution images requires inversion. Then, to get rid of the repetition, they suggest a complex dynamic re-dilation technique. They also suggest distributed convolution and noise-damped classifier-free guidance for ultra-high-resolution generation. Evaluations are done to show how well their techniques work with various text-to-image and text-to-video models.

Reference

https://arxiv.org/pdf/2310.07702v1.pdf

https://github.com/YingqingHe/ScaleCrafter


Similar Posts

Signup MLNews Newsletter

What Will You Get?

Bonus

Get A Free Workshop on
AI Development