In recent years, the use of diffusion models in text-to-image generation has yielded remarkable results. These models have significantly improved image quality and inference performance and expanded the creative possibilities in this field. However, managing the generation process effectively remains a challenge, particularly when it comes to conditions that are difficult to express in words.
To address this challenge, Google researchers have developed MediaPipe dispersion plugins, which enable on-device text-to-image generation under user control. In their latest study, the researchers build upon their previous work on GPU inference for large generative models on the device itself, presenting cost-effective solutions for programmable text-to-image creation. These solutions can be seamlessly integrated into existing diffusion models and their Low-Rank Adaptation (LoRA) variations.
The key idea behind diffusion models is iterative denoising for image production. Each iteration of the diffusion model starts with a noisy image and progressively refines it toward the target concept. Text prompts have played a crucial role in enhancing the image generation process by providing language understanding. The text embedding is connected to the model for text-to-image production through cross-attention layers. However, conveying certain details, such as object position and pose, solely through text prompts can be challenging. To address this, the researchers introduce control information from a condition image into the diffusion process, leveraging additional models.
Various methods are commonly employed to generate controlled text-to-image output, including Plug-and-Play, ControlNet, and T2I Adapters. Plug-and-Play utilizes a copy of the diffusion model and a denoising diffusion implicit model (DDIM) inversion approach to encode the state from an input image and derive an initial noise input. The spatial features extracted from the copied diffusion model are then injected into the text-to-image diffusion. ControlNet, on the other hand, constructs a trainable duplicate of the encoder of a diffusion model, which encodes conditioning information and passes it to the decoder layers through a convolution layer.
However, ControlNet’s increased size, about half as much as the diffusion model itself, poses a limitation. T2I Adapter overcomes this by delivering comparable results in controlled generation with a smaller network. It takes the condition picture as the only input and utilizes the result in all subsequent diffusion cycles. Nevertheless, this adapter style is not designed for mobile devices.
To address the limitations and challenges faced by existing methods, the Google researchers introduce the MediaPipe diffusion plugin—a standalone network that enhances conditioned generation in a more effective, flexible, and scalable manner. The plugin connects effortlessly to a trained baseline model, making it pluggable. Importantly, it does not rely on any weights from the original model, ensuring zero-based training. Furthermore, the plugin’s portability allows it to run independently on mobile devices without incurring significant additional costs.
Also read: All About the Google Bard AI Chatbot
The plugin network integrates seamlessly with an existing model for converting text to images. The diffusion model’s corresponding downsampling layer receives the features extracted by the plugin. This portable on-device paradigm for text-to-image creation is made available as a free download. The plugin model aims to have only 6 million parameters, making it relatively simple. To ensure rapid inference on mobile devices, MobileNetv2 is employed, utilizing depth-wise convolutions and inverted bottlenecks.
In summary, the introduction of MediaPipe dispersion plugins by Google researchers marks a significant advancement in on-device text-to-image generation. By addressing the challenges of effective generation management and providing low-cost solutions for programmable text-to-image creation, these plugins offer enhanced control and flexibility. With their easy integration into existing diffusion models, the plugins empower users to create high-quality images from text prompts on their mobile devices. The MediaPipe diffusion plugin represents an exciting development in the field of text-to-image generation and holds promise.
Also read: Blockchain Development with AI