Member-only story

Beyond Diffusion: What is Personalized Image Generation and How Can You Customize Image Synthesis?

Personalized Image Generation by Fine-Tuning the Stable Diffusion Models

--

Figure 1: Personalized Diffusion In Action (Source: Author)

In this article you will learn about customization and personalization of diffusion model-based image generation. More specifically, you will learn about the Textual-Inversion and Dream-Booth. This article will build upon the concepts of Autoencoders, Stable Diffusion Models (SD) and Transformers. So, if you would like to know more about those concepts, feel free to check out my earlier posts on these topics.

Text-to-Image generators based on diffusion models are one of the major developments in the fields of Deep Learning and Image generation. Such generators (e.g., Stable Diffusion Models [1]) are robust and can accurately generate images of various concepts in a variety of backgrounds and contexts. This opened a whole new area of research and innovation. However, these generations are uncontrolled and cannot be customized to personal taste. One must rely only on the concepts already trained in the network. For instance, it is not possible to query at the prompt of SD with a word/image from your own personal life (e.g., a unique name of your pet/cartoon-character or image of your personal toy) and modify its pose or context.

One way to achieve such customization is by fine-tuning the SD with new images. However, fine-tuning the complete model with a few images can have drastic effect on the learning, as network quickly tend to overfit on the new images and forget the diverse concept space to which it was originally trained. Therefore, a controlled fine-tuning mechanism is desired which allows one the possibility of injecting new concepts into the pre-trained model while preserving the previous learning.

Textual-Inversion

A recently proposed technique termed Textual Inversion is one of the methods which achieve customized image generation by fine-tuning the SD models [2]. It takes a few training instances (e.g., 3–5 images) along with a prompt of a novel concept and learns to generate the images of new object/style. The key point is that the Textual-Inversion doesn’t re-learn the whole parameter space of an SD network but only learns the…

--

--

.
.

Written by .

AI Research Scientist, Educator, and Innovator.Writes about Deep learning, Computer Vision, Machine Learning, AI, & Philosophy.

No responses yet

Write a response