This repository showcases the results of an AI image and video generation project using Stable Diffusion. The project involves utilizing the Stable Diffusion WebUI for Prompt Engineering, ControlNet, Dreambooth, and LoRA Generative AI models. Dreambooth and LoRA models have been trained on a custom dataset,and the generated content is included in this repository.
- Stable Diffusion Introduction and WebUI Installation
- Text to Image
- Image to Image
- How to Write Prompts
- ControlNet Variants
- Dreambooth LoRA Models Training
- Video Generation with Deforum
- Animating Real Person Videos with Move to Move
- Video Generation with Animatediff
- Conclusion
Stable Diffusion is a generative AI technique that involves the controlled diffusion of information throughout a system. Diffusion models are probabilistic models that describe how data, in this case, an image, changes or diffuses over time. The stable diffusion approach aims to create high-quality and diverse images by iteratively applying controlled diffusion processes to an initial image.
In the context of AI and creative applications, stable diffusion is often used to generate visually appealing and novel artworks. By manipulating the diffusion process through prompts and input parameters, users can guide the AI in creating unique and imaginative images. The stability in diffusion refers to the controlled and coherent evolution of the image during the generation process.
The technique is versatile, allowing users to explore a wide range of creative possibilities by influencing factors such as lighting, style, environment, and more. It is particularly popular in the field of generative art, where artists and AI enthusiasts leverage stable diffusion to produce captivating and diverse visual content.

-
hardware requirements
- Processor (CPU): Apple Silicon (M1 or M2) โ Recommended CPUs include M1, M1 Pro, M1 Max, M2, M2 Pro, and M2 Max. Both efficient and performance cores are important.
- Memory (RAM): Ideally, your machine should have 16 GB of memory or more.
- Performance Comparison: Stable Diffusion runs slower on Mac. A similarly priced Windows PC with a dedicated GPU is expected to deliver images faster.
-
system requirements
- You should have an Apple Silicon M1 or M2, with at least 8GB RAM. Your MacOS version should be at least 12.3. Click the Apple icon on the top left and click About this Mac. Update your MacOS before if necessary.
-
Install AUTOMATIC1111 on Mac
-
Creating an Anaconda Virtual Environment
conda create --n YourEnvName
conda activate YourEnvName
A new folder stable-diffusion-webui should be created under your home directory.
git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui
cd ~/stable-diffusion-webui;./webui.sh --no-half
Open a web browser and click the following URL to start Stable Diffusion.
http://127.0.0.1:7860/
-
- Version 1. AUTOMATIC1111
- Version 2. digiclau korean ver.
The Stable Diffusion model, akin to the artist who draws in the space provided by the Stable Diffusion webui, is the entity responsible for creating images. In other words, choosing a model is comparable to selecting the artist who will be drawing. While techniques like LoRA, embedding, hypernetwork, and others are capable of generating images, the model or checkpoint serves as the artist; without it, there is no one to create the artwork. Therefore, having the right model (checkpoint) is essential for the generation of images in the Stable Diffusion framework.
In the same way, choosing a model is akin to selecting the artist, and just as the style of an artwork varies depending on who is drawing, the images generated differ significantly based on the choice of the model. There are two main websites where you can download these models.
-
Civitai : https://civitai.com
-
Hugging face : https://huggingface.co
The model files with the extensions .safetensors and .ckpt are related to the Stable Diffusion webui and represent different aspects of the model:
-
safetensors : This file contains tensors (data structures representing multi-dimensional arrays) related to the model. It may include information about the model's architecture, parameters, and other essential components. The ".safetensors" extension suggests that the data stored in this file is considered safe or stable for the model's functioning.
-
'ckpt' : This file typically represents a checkpoint file and contains the saved weights and biases of the model. It allows the model to be saved and restored at a later time, enabling users to continue training or deploy the model without starting from scratch. The ".ckpt" extension is a common convention in machine learning to denote checkpoint files.

-
Parameters
- Sampling Steps : the number of steps or iterations during the sampling process. A higher value may result in more detailed and refined images but will require more computational resources.
- Sampling Method : Refers to the technique used for sampling during image generation. Different methods can influence the diversity and quality of generated images.
- CFG Scale : Stands for "config scale." It represents a scaling factor for the configuration settings, influencing the overall size and structure of the generated images.
- Batch Size : The number of samples processed in one iteration. A larger batch size can speed up training but requires more memory.
- Size (Width, Height) : Defines the dimensions of the generated images. Specified as width and height, this setting determines the resolution and aspect ratio of the output images.
- Seed : A seed is an initial value used to start the randomization process. Setting a seed allows for reproducibility, ensuring that running the model with the same seed produces the same results.
Using Stable Diffusion for generating cosmetic model images through txt2img is an innovative and practical approach. It reflects the potential of AI-generated images for advertising and marketing purposes without relying on actual models.
- Sampling Steps: 30
- Sampling method: DPM++ 2M Karras
- CFG scale: 4
- Size: 512x720
- Model: Realistic_Vision_V5.1_fp16-no-ema
- VAE: vae-ft-ema-560000-ema-pruned.ckpt
Prompt : (realistic, photo-realistic:1.37), professional lighting, photon mapping, radiosity, 1girl, smile,
(holding a perfume:1.3),perfume, (medium shot), (looking at viewer:1), high quality, highres,
8k, accurate color reproduction, realistic texture,((simple background, white background)),
((wearing turtleneck sweater)), (extra deatailed), (best quality), 1girl, ((extra deatailed body:1.3)),
(realistic), narrow waist, (straight hair, medium-length hair, black hair, partedhairs:1.45), breasts,
pale skin, (realistic glistening skin:1.2), skindentation, masterpiece, (proportional eyes, same size eyes),
<lora:jwy___v1:1>
Negative prompt: 7dirtywords, easynegative, (nudity:1.3), nsfw, (worst quality:2.0), bad-artist, bad-hands-5
In Stable Diffusion, Img2img allows the generation of new images from existing ones. This concept is highly versatile, enabling the creation of images in various styles. Whether transforming realistic images into animated styles or generating images in different artistic expressions, Img2img provides a powerful tool for diverse and creative image synthesis. This feature is particularly advantageous for artists, designers, and content creators seeking flexibility and creative freedom in their image generation process.
I used the realisticVisionV60B1, revAnimized models for the generation of various styles of images.
- Results with the Inpaint feature applied

In Stable Diffusion WebUI, the Inpaint feature is a powerful tool that allows the transformation of images using the Img2img model. Specifically, it can be employed to seamlessly replace or fill in missing parts of an image, enhancing its visual appeal. This functionality becomes particularly useful in scenarios where image editing or manipulation is required, such as removing unwanted objects, correcting imperfections, or, as in this case, turning a striped shirt into a cat wearing a striped T-shirt.
A prompt is a concise and specific input provided to a generative model to guide its content creation. It typically consists of a brief textual description or set of instructions that influences the output of the model. In the context of Stable Diffusion webui, prompts are used to shape the characteristics, style, or subject matter of the generated images or videos.
Stable Diffusion webui has two input fields for prompts. The first is called the positive prompt, and the second is called the negative prompt. In the positive prompt, you include the content you want reflected in the generated images, while in the negative prompt, you include the content you prefer not to be reflected. However, it's important to note that not everything included in the prompts will be entirely reflected or excluded based on positive or negative prompts.
A token, in simple terms, can be thought of as a unit of text, often corresponding to a single character or word. In the context of prompt writing, the number of tokens refers to the count of these text units. The prompt input field in the upper right corner of the Stable Diffusion webui displays the token count. It's recommended to keep prompts within 75 tokens, as the interpretation process divides the text into segments of 75 tokens each. This limitation ensures effective processing and interpretation of the input prompt.
Weight, in simple terms, refers to the influence or impact of a prompt. A prompt without a specified weight is assigned a default weight of 1. You can increase the influence of a prompt by assigning a weight, and there are two ways to do so:
-
Enclose in Parentheses : You can enclose the prompt in parentheses to give it additional weight. For example:(best quality)
-
Colon Notation : Alternatively, you can use colon notation to explicitly specify the weight. For example: (best quality:1.5)
In this case, a prompt with a weight has a greater impact compared to a prompt without any weight. However, it's essential not to set excessively high weights for a single prompt, as it may negatively affect the generated image. It is generally recommended to set weights within the range of 0.8 to 1.5 to maintain a balance and avoid potential image degradation.
-
Sentence Type Prompts : Sentence type prompts are structured phrases or sentences that provide detailed descriptions in a sentence or clause format. They are ideal for expressing elements like composition, scenario, or actions. Examples include prompts describing appearance, state, background, etc.
- Example of Sentence Type Prompt: (standing on the table),(looking at window)
-
Tag Type Prompts : Tag type prompts consist of single-word prompts that act as concise tags representing specific attributes such as appearance, state, or background. They are more focused and efficient for conveying certain aspects of the desired image.
- Example of Tag Type Prompt: (black_hair),(white_background)
- Prompts
high quality, 8k, best quality, accurate color reproduction, masterpiece, proportional eyes, same size eyes,
detailed body,radiosity, realistic, photo-realistic, sharp details, vibrant colors, crystal clear,
stunning clarity, vivid texture, lifelike rendering, optimal lighting, fine details, rich shadows
- Negative Prompts
7dirtywords, easynegative, worst quality, low quality, extra fingers, fewer fingers,missing fingers,
extra arms, inaccurate eyes, ugly, deformed, noisy, blurry,low contrast, distorted proportions,
unrealistic colors, pixelated, dull appearance, unnatural lighting,jagged edges, inconsistent shadows

ControlNet is introduced as a neural network structure designed to augment pretrained large diffusion models, such as Stable Diffusion, by incorporating additional input conditions. The primary purpose of ControlNet is to learn task-specific conditions in an end-to-end manner. Remarkably, the learning process remains robust even when the training dataset is limited, with effectiveness demonstrated even with datasets smaller than 50,000 samples.
ControlNet offers the advantage of efficient training, comparable in speed to fine-tuning a diffusion model. Notably, this training can be performed on personal devices, making it accessible for a broader range of users. Alternatively, if powerful computation clusters are available, ControlNet has the capacity to scale to large datasets, ranging from millions to billions of data points.
The integration of ControlNet with large diffusion models, exemplified by Stable Diffusion, enables the introduction of conditional inputs like edge maps, segmentation maps, keypoints, and more. This capability enriches the methods to control large diffusion models, opening avenues for enhanced control and customization in various applications related to image generation and manipulation.
Combining ControlNet models allows for the generation of more customized images based on specific conditions. For instance, using f OpenPose extension in a original image, one can generate a new image with a pose matching that of the person in the original image. This showcases the capability of ControlNet models to leverage different input conditions for creating tailored and desired images.
-
Released Checkpoints
The initial release of ControNet came with the following checkpoints.
- Canny edge : A monochrome image with white edges on a black background
- Depth : A grayscale image with black representing deep areas and white representing shallow areas
- Openpose : A OpenPose bone image
- Semantic Segmentation Map: An ADE20K's segmentation protocol image
- Lineart : Lineart typically refers to the lines that outline the shapes and forms in an image, often used in illustrations or drawings
- Softedge : Soft edges generally imply smooth transitions between different regions in an image, as opposed to sharp or well-defined edges
Typically, these six ControlNet models are commonly used in practical applications. You can download the checkpoint files for these models from Hugging Face.
-
ReActor
The ReActor Face Swapping Extension in Stable Diffusion is introduced as a robust tool intended to address the absence of Roop. This extension facilitates lifelike face swaps within the Stable Diffusion framework. The comprehensive guide provides instructions on downloading and using the ReActor extension, offering users the capability to achieve realistic face-swapping effects. Additional details and resources can be accessed on the official ReActor GitHub page.
- High-Resolution Face Swaps with Upscaling
- Efficient CPU Performance
- Compatibility Across SDXL and 1.5 Models
- Automatic Gender and Age Detection
- No NSFW Filter (Uncensored)
- Continuous Development and Updates
In summary, the ReActor Extension stands out for high-resolution face swaps with advanced upscaling, optimized for CPU-only setups, offering versatility across SDXL and 1.5 models. It simplifies face-swapping with automatic gender and age detection, supports uncensored swaps, and excels in continuous development for evolving features and advancements in face-swapping technology.

Here is a example of a face swap in Stable Diffusion using Margot Robbieโs face:
๐ Source : https://ngwaifoong92.medium.com/introduction-to-controlnet-for-stable-diffusion-ea83e77f086e , https://www.nextdiffusion.ai/tutorials/how-to-face-swap-in-stable-diffusion-with-reactor-extension
- A method of adding new concepts to an already trained model
- Fine-tunes the weights of the entire model
- Creates a new checkpoint (weight) as the entire model is modified
- Occupies a significant amount of disk space, approximately 1-7GB
- High fidelity to visual features of the subject, preserving existing model knowledge even with fine-tuning using just a few images.

The Dreambooth model operates by taking a small set of input images, usually 3-5, depicting a specific subject, along with the corresponding class name (e.g., "dog"). It then produces a fine-tuned or personalized text-to-image model. This model encodes a distinctive identifier specific to the subject. During the inference stage, this unique identifier can be embedded in various sentences to generate synthesized images of the subject in different contexts.

The structure involves a two-step fine-tuning process using approximately 3-5 images of a subject:
(a) The initial step fine-tunes a low-resolution text-to-image model using input images paired with a text prompt containing a unique identifier and the class name of the subject (e.g., "A photo of a [T] dog"). Simultaneously, a class-specific prior preservation loss is applied. This loss leverages the model's semantic understanding of the class, encouraging the generation of diverse instances belonging to the subject's class by injecting the class name into the text prompt (e.g., "A photo of a dog").
(b) The subsequent step fine-tunes the super resolution components using pairs of low-resolution and high-resolution images derived from the input image set. This process enables the model to maintain high-fidelity to small details of the subject.
๐ Source : https://dreambooth.github.io
I fine-tuned the Dreambooth model using 20 pictures of our dog, Kami on Colab. The pre-trained model used for fine-tuning was realisticVisionV60B1.
-
Parameters
- Crop size : 512
- Unet_training_steps : 3300
- Unet_learning_rate : 2e-6
- Text_encoder_training_steps : 350
- Text_encoder_learning_rate : 1e-6
- Resolution : 512
- LoRA introduces subtle changes to the most critical part of the Stable Diffusion model, the cross-attention layer. The cross-attention layer is the point where images and prompts intersect, and even small changes can have significant effects.
- The modified parts are saved in a separate file and used in conjunction with the ckpt (base model) file.
- The file size ranges from 2 to 200 MB, relatively smaller compared to the Dreambooth model, and it exhibits decent learning capabilities.
- The reason behind the smaller file size of the LoRA model, even while storing the same number of weights, lies in its approach of decomposing large matrices into two smaller submatrices with low-rank. In other words, LoRA can store significantly fewer numbers by decomposing matrices into two low-rank matrices.

The weights of the cross-attention layer are stored in a matrix. Essentially, a matrix is just an arrangement of numbers organized in rows and columns, similar to an Excel spreadsheet. The LoRA model fine-tunes itself by adding weights to this matrix.
- Weighted Sum

Let's assume a model has a matrix composed of 1000 rows and 2000 columns. In this case, the model file would store 2 million (1000x2000) numbers. LoRA, however, splits this matrix into a 1000x2 matrix and a 2x2000 matrix. This results in only 6000 numbers in total (1000x2 + 2x2000), reducing the size to 1/333 compared to the original matrix. That's why the LoRA file is much smaller.
In this example, the rank of the matrix stored in LoRA is 2, significantly smaller than the original matrix's rank of 2000. This type of reduced-dimension matrix is called a low-rank matrix. However, researchers suggest that reducing the size of the matrix in the cross-attention layer doesn't significantly impact fine-tuning performance. Fortunately, this approach works well.
๐ Source : Thesis , https://www.internetmap.kr/entry/How-to-LoRA-Model
I fine-tuned the LoRA model using 20 pictures of Pikachu on Colab. The pre-trained model used for fine-tuning was Chillout-Mix.
-
Parameters
- Data annotation : BLIP Captioning (batch size 8, max_data_loader_n_workers 2)
- Datasets : resolution = 512, min_bucket_reso = 256, max_bucket_reso = 1024, caption_dropout_rate = 0, caption_tag_dropout_rate = 0, caption_dropout_every_n_epochs = 0, flip_aug = false , color_aug = false
- Optimizer_argument : optimizer_type = "AdamW", learning_rate = 0.0001, max_grad_norm = 1.0, lr_scheduler = "constant", lr_warmup_steps = 0
- Training_arguments : save_precision = "fp16", save_every_n_epochs = 5, train_batch_size = 3, max_token_length = 225, mem_eff_attn = false, xformers = true, max_train_epochs = 25, max_data_loader_n_workers = 8, persistent_data_loader_workers = true, gradient_checkpointing = false, gradient_accumulation_steps = 1, mixed_precision = "fp16", clip_skip = 2

Deforum is an open-source animation tool that leverages Stable Diffusion's image-to-image function to produce dynamic video content. The process involves generating a sequence of images and then stitching them together to create a coherent video.
The animation is achieved by applying slight transformations to each frame. The image-to-image function is utilized to generate subsequent frames, ensuring that the transitions between frames are minimal. This approach creates the illusion of smooth continuity, resulting in a visually pleasing and fluid video.
-
Parameters
-
Translation (x, y, z) : Translation represents the movement of the image in three-dimensional space. Translation x,y,z denotes movement along the x,y,z-axis.
-
Rotation (3d x, y, z) : Rotation deals with the orientation or rotation of the image in three-dimensional space. Rotation 3d x,y,z represents rotation around the x,y,z-axis.
-
Noise Schedule : The noise schedule refers to a predefined plan or sequence for introducing noise during the generation process. It helps control the randomness or variability in the generated images or video frames. Adjusting the noise schedule can influence the level of detail, texture, or unpredictability in the final output.
You can find more detailed parameter settings on this website : https://stable-diffusion-art.com/deforum/
-
- Superman video generated with Deforum
-
Parameters
- Translation x: 0: (0), 30:(15), 210:(15), 300:(0)
- Translation z: 0: (0.2), 60:(10), 300:(15)
- Rotation 3d x: 0: (0), 60:(0), 90:(0.5), 179:(0), 180:(150), 300:(0.5)
- Rotation 3d y: 0: (0), 30:(-3.5), 90:(0.5), 180:(-2.8), 300:(-2), 420:(0)
- Rotation 3d z: 0: (0), 60:(0.2), 90:(0), 180:(-0.5), 300:(0), 420:(0.5), 500:(0.8)
- Noise schedule: 0: (-0.06*(cos(3.141*t/15)**100)+0.06)
- Prompts
"0": "Superman soaring through the sky, descending to rescue a person, vibrant colors in the background
with clouds and sunlight, Digital illustration, good quality, realistic",
"60": "Superman descending in an urban environment at night, city lights below creating a dramatic atmosphere,
a mix of tension and relief in the atmosphere, realistic, good quality",
"120": " Superman descending in a futuristic cityscape, surrounded by holographic displays and advanced technology,
neon lights and advanced architecture, realistic, good quality",
"180": "Superman descending in a natural setting with a serene landscape, mountains and clear blue sky,
stable diffusion capturing the peaceful yet powerful moment, realistic, good quality, 8k",
"220":"Superman swooping down towards a person in a chaotic battlefield, smoke and debris in the background,
realistic, good quality, 8k"
Video-to-video tasks are typically labor-intensive and time-consuming, demanding significant manual effort to achieve desired results. The Mov2mov extension, integrated into Stable Diffusion, revolutionizes this workflow by introducing automation to streamline and simplify the entire process. This extension significantly reduces the need for manual intervention, making video-to-video tasks more efficient and accessible to users.
- Step1 - Enter a Prompt
Next, enter the desired prompt and negative prompt for your video. You can use a detailed description or specific keywords to guide the video generation process.
- Step2 - Upload Video
Upload the video you wish to work with by dropping it onto the video canvas. Set the Resize mode to: "Crop and resize".
-
Step3 - Mov2mov Settings
When using the Mov2mov extension, here are some key settings to consider:
- Sampling method : Keep in mind that deterministic samplers like Euler, LMS, and DPM++2M Karras might not work well with this extension, as they may not effectively reduce flickering.
- Noise Multiplier : Utilize the slider to adjust the noise multiplier. For smoother results and reduced flickering, keep it at 0.
- CFG Scale : Control the extent to which the prompt is followed by adjusting the CFG scale. In the provided video, a scale of 7 was used.
- Denoising Strength : Fine-tune the amount of change applied to the video by adjusting the denoising strength. A value of 0.6 was used in the example video.
- Movie Frames : The frames per second of your output. The higher this value, the smoother your video, but this will have to render more images.
- Max Frame : Determine the total number of frames to be generated. For initial testing, set it to a low number such as 10. To generate a full-length video, set it to -1.
- Seed : The seed determines the value for the first frame. All frames will use the same seed value, even if you set it to -1 for a random seed.
With all the settings in place, it's time to generate the video. Click the "Generate" button to start the process. Be patient as it may take some time. Once the generation is complete, your new video will appear on the right side of the page.
Click "Save" to download and save the video to your device. If you can't locate the video, check the output/mov2mov-videos folder.
๐ Source : https://www.nextdiffusion.ai/tutorials/transforming-videos-into-stunning-ai-animations-with-stable-diffusion-mov2mov ๐ Original Videos : https://youtube.com/shorts/4cT2swoyNAY?si=B_OwxpMP-D2msK7I , https://youtube.com/shorts/2yZRp7wcqKk?si=FWDIucIFD7xG5cRG
Animatediff in Stable Diffusion refers to a extension that facilitates the creation of animated sequences or transformations between two images. It allows users to generate smooth transitions, morphing, or animations by specifying two keyframes or images. The system then intelligently interpolates between these images, producing a visually coherent and dynamic sequence.
This extension is particularly useful for tasks such as transitioning between different poses, expressions, or styles in images. Animatediff enhances the creative possibilities within Stable Diffusion, enabling users to generate captivating and seamless animations with ease.
-
AnimateDiff Settings
Once you've installed the AnimateDiff extension, it will be available at the bottom of the Stable Diffusion interface. To use it, click on the "AnimatedDiff" option and the interface should fold out. Here are some settings you can configure, I will list the settings I recommend.
-
Troubleshooting
If you experience a long generating time press "Remove motion module from any memory" before generating. It's also helpful to keep the negative prompt under 75 characters. Use max 16 number of frames.
-
Using Animatediff for generating dynamic and animated videos
- Model : ToonYou
- Sampling method: DPM++ 2M Karras
- Steps: 40
- Resolution: 512x512
- CFG Scale: 8
- Model : Realistic Vision v6.0 B1
- Sampling method: DPM++ 2M Karras
- Steps: 30
- Resolution: 512x512
- CFG Scale: 7
๐ Source : https://github.com/guoyww/animatediff/?tab=readme-ov-file , https://www.nextdiffusion.ai/tutorials/how-to-make-gif-animations-with-stable-diffusion-animatediff
In conclusion, this project delved into the diverse capabilities offered by Stable Diffusion, showcasing its prowess in image and video generation. From the foundational aspects covered in the introduction to the advanced features explored in later sections, we've navigated through text-to-image synthesis, various image manipulations, and sophisticated video generation techniques.
The utilization of ControlNet variants has provided insights into enhancing the guidance and control mechanisms for image creation, demonstrating the flexibility and adaptability of Stable Diffusion. Additionally, the training of Dreambooth and LoRA models has unveiled the intricacies of fine-tuning and personalizing image generation, pushing the boundaries of what can be achieved.
The practical applications demonstrated through Deforum and Move to Move showcase how Stable Diffusion can be integrated into real-world scenarios, from video generation to animating real-person videos. The Animatediff feature adds an extra layer of creativity, allowing for smooth transitions and captivating animations.
As technology advances, Stable Diffusion stands as a powerful tool in the realm of AI-driven image and video generation. This project serves as a comprehensive guide, and the outlined procedures, models, and techniques open avenues for exploration and innovation. The rich functionalities offered by Stable Diffusion, combined with ControlNet variants and other extensions, contribute to a dynamic and evolving landscape in the field of AI-generated content.
Whether you are an AI enthusiast, researcher, or practitioner, the journey through Stable Diffusion presented here invites you to explore, experiment, and leverage these tools to create compelling and realistic visual content.
Thank you for Reading this exploration of Stable Diffusion and its applications. The possibilities are vast, and we look forward to witnessing the continued advancements and creative endeavors in the exciting field of AI image and video generation.