Fine Tuning FLUX: Personalize AI Image Models on Minimal Data for Custom Look and Feel
2024-08-15
Black Forest Labs (alumni of Stability AI) launched FLUX.1, an open-sourced suite of AI image generation models that you can run locally.
FLUX models took socials by storm, because the largest one, FLUX.1 [pro], outperformed Stable Diffusion 3 Ultra, Midjourney v6.0, and DALL·E 3 HD.
Look at the benchmarks, absolutely crazy!
There are 3 models:
- Flux.1 [pro]: Proprietary, API-based, $0.055/image.
- Flux.1 [dev]: 12B parameters, noncommercial use.
- Flux.1 [schnell]: 12B parameters, speed-optimized, Apache 2.0.
and all of them are based on hybrid multimodal transformer blocks:
- Parallel diffusion and parallel attention layers
- Scaled to 12B parameters
- Flow matching (consistently better performance than alternative diffusion-based methods in terms of both likelihood and sample quality)
There are lots of LoRAs and extensions released as you read this article, and people say it’s noticeably better than Midjourney, and my experience is also very… very promising!
FLUX models democratize access to cutting-edge generative AI research, and push the limits of text-to-image synthesis.
The fun fact is that Black Forest CEO Robin Rombach also co-authored key papers on VQGAN, latent diffusion, adversarial diffusion distillation, Stable Diffusion XL, and Stable Video Diffusion.
We will see a wave of open versions of Flux, that’s why I wanted to walk you through Flux fine tuning using LoRA locally.
Let’s GOOOOO!
Setting up the environment for Flux LoRA fine tuning
First things first, here’s what you should already have:
- python >3.10
- Nvidia GPU with at least 24GB of VRAM to train FLUX.1 (I’m running this with 4090)
- python venv
- git
Once everything is in place, open your command line interface — this could be your Command Prompt, Terminal, or any other CLI tool you’re comfortable with — and run the following commands:
git clone https://github.com/ostris/ai-toolkit.git
cd ai-toolkit
git submodule update --init --recursive
python3 -m venv flux-finetune-env
source flux-finetune-env/bin/activate
pip3 install torch
pip3 install -r requirements.txt
You should now be looking at the full project as the following:
Now, we will set up the Hugging Face:
- Log into HF and accept the model access for [black-forest-labs/FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev). Currently, we can only work with FLUX.1-dev, that inherits the non-commercial license.
- Create a file named .env in the root on this folder.
- [Get a READ key from huggingface](https://huggingface.co/settings/tokens/new?)
and add it to the .env file:
HF_TOKEN=hf_jpTKpr....
and we are ready to go!
Dataset preparation for fine tuning FLUX models
Here’s how we will prepare our dataset according the guidance shared in the repository:
- Folder Structure: Dataset will be organized in a folder containing images and their corresponding text files.
- File Naming: Text files will have the same name as their corresponding image but with a .txt extension (e.g., image22.jpg and image22.txt).
- Supported Image Formats: Only JPG, JPEG, and PNG formats are supported. Avoid using WebP due to known issues.
- Text File Content: Each text file should contain only the caption for the corresponding image. The word [trigger] can be included in the caption if using trigger_word in the configuration; it will be automatically replaced.
- Image Processing: Images are never upscaled but will be downscaled as needed and placed in appropriate buckets for batching.
- Image Cropping/Resizing: Manual cropping or resizing of images is not required. The loader will automatically resize images and can handle varying aspect ratios.
In the end, your data folder should look like the following:
For each image, there should be .jpg file and corresponding .txt file that contains captions.
I browsed internet for the pictures (40) and asked ChatGPT to create captions and generate .txt files to download.
You can find the full dataset here.
As you can see, I’m going for that vintage look & feel here.
Configuring FLUX fine tuning process
First, let’s copy the example config file located at config/examples/train_lora_flux_24gb.yaml to the config folder and rename it to flux_vintage_aesthetics.yaml
and make the following edits:
---
job: extension
config:
# this name will be the folder and filename name
name: "flux_vintage_aesthetics"
process:
- type: 'sd_trainer'
# root folder to save training sessions/samples/weights
training_folder: "output/vintageae"
# uncomment to see performance stats in the terminal every N steps
performance_log_every: 1000
device: cuda:0
# if a trigger word is specified, it will be added to captions of training data if it does not already exist
# alternatively, in your captions you can add [trigger] and it will be replaced with the trigger word
trigger_word: "v1nt4g3"
network:
type: "lora"
linear: 32
linear_alpha: 32
save:
dtype: float16 # precision to save
save_every: 250 # save every this many steps
max_step_saves_to_keep: 4 # how many intermittent saves to keep
datasets:
# datasets are a folder of images. captions need to be txt files with the same name as the image
# for instance image2.jpg and image2.txt. Only jpg, jpeg, and png are supported currently
# images will automatically be resized and bucketed into the resolution specified
# on windows, escape back slashes with another backslash so
# "C:\\path\\to\\images\\folder"
- folder_path: "data"
caption_ext: "txt"
caption_dropout_rate: 0.05 # will drop out the caption 5% of time
shuffle_tokens: false # shuffle caption order, split by commas
cache_latents_to_disk: true # leave this true unless you know what you're doing
resolution: [ 512, 768, 1024 ] # flux enjoys multiple resolutions
train:
batch_size: 1
steps: 4000 # total number of steps to train 500 - 4000 is a good range
gradient_accumulation_steps: 1
train_unet: true
train_text_encoder: false # probably won't work with flux
gradient_checkpointing: true # need the on unless you have a ton of vram
noise_scheduler: "flowmatch" # for training only
optimizer: "adamw8bit"
lr: 1e-4
# uncomment this to skip the pre training sample
skip_first_sample: true
# uncomment to completely disable sampling
# disable_sampling: true
# uncomment to use new vell curved weighting. Experimental but may produce better results
linear_timesteps: true
# ema will smooth out learning, but could slow it down. Recommended to leave on.
ema_config:
use_ema: true
ema_decay: 0.99
# will probably need this if gpu supports it for flux, other dtypes may not work correctly
dtype: bf16
model:
# huggingface model name or path
name_or_path: "black-forest-labs/FLUX.1-dev"
is_flux: true
quantize: true # run 8bit mixed precision
# low_vram: true # uncomment this if the GPU is connected to your monitors. It will use less vram to quantize, but is slower.
sample:
sampler: "flowmatch" # must match train.noise_scheduler
sample_every: 250 # sample every this many steps
width: 1024
height: 1024
prompts:
# you can add [trigger] to the prompts here and it will be replaced with the trigger word
- "[trigger] holding a sign that says 'I LOVE VINTAGE!'"
- "[trigger] with red hair, playing chess at the park, next to a vintage white car on a tree-lined street"
- "[trigger] holding a coffee cup, in a beanie, sitting at a cafe"
- "[trigger] exudes elegance in a light beige suit and oversized hat, as a DJ at a night club, fish eye lens, smoke machine, lazer lights, holding a martini"
- "[trigger] showing off his cool new t shirt at the beach, a shark is jumping out of the water in the background"
- "A couple walks hand in hand through a charming countryside setting in the snow covered mountains"
- "[trigger] wearing short-sleeved white shirt tucked into high-waisted black trousers, playing the guitar on stage, singing a song"
- "hipster man with a beard, dressed in chic and retro-inspired outfits, building a chair, in a wood shop"
- "A stylish couple walks hand in hand outside a luxurious building. The woman wears a black dress and wide-brimmed hat, while the man is dressed in a tailored brown blazer and dark trousers."
- "A woman enjoys a sunny day by the river, leaning back with her face to the sun. She is dressed in a loose blue shirt and white shorts, with a woven bag by her side, capturing a carefree moment."
- "[trigger] stands confidently on a boat, holding onto a beam as the wind blows through its hair. He is dressed in a blue and white striped shirt paired with white trousers, exuding a nautical style."
neg: "" # not used on flux
seed: 42
walk_seed: true
guidance_scale: 4
sample_steps: 20
# you can add any additional meta info here. [name] is replaced with config name at top
meta:
name: "[vintageae]"
version: '1.0'
Let’s briefly have a look at what’s going on in the configuration file:
- The job is set to 'extension' since the task involves extending or fine-tuning an existing model.
- config.name is 'flux_vintage_aesthetics,' which defines the folder and filename conventions for saving outputs.
- The process type is defined as sd_trainer, indicating that the training is based on Stable Diffusion.
- The training_folder is set to 'output/vintageae,' where all training outputs, including models, logs, and samples, will be saved.
- Performance logging is configured to log every 1000 steps, allowing you to monitor training progress in the terminal.
- The training is set to use the GPU device cuda:
- This should match the specific hardware configuration of your machine.
- A trigger_word 'v1nt4g3' is specified, which will be automatically added to captions during training if it's not already present. This ensures consistency in the training data.
- The network configuration specifies the use of a “lora” network with linear and linear_alpha both set to
- These parameters control the size of the linear layers in the network, affecting the model's capacity and complexity.
- The model weights will be saved with dtype set to float16, which balances precision and memory efficiency. The model will be saved every 250 steps, and up to 4 intermediate saves will be kept.
- The datasets section defines the input data. Images are stored in the folder_path: 'data' directory, with corresponding captions in .txt files.
- The caption_dropout_rate is set to 0.05, meaning captions will be randomly dropped 5% of the time during training to introduce variation.
- shuffle_tokens is set to false, meaning the order of tokens in captions will not be shuffled.
- cache_latents_to_disk is enabled to improve performance.
- The images will be automatically resized to one of the resolutions specified in the resolution list: 512, 768, or 1024 pixels. This supports multi-resolution training, which can improve the model's generalization.
- Training settings include a batch_size of 1 and a total of 4000 training steps, which is within a good range for fine-tuning. gradient_accumulation_steps is set to 1, which affects how gradients are accumulated before updating the model weights.
- train_unet is set to true, enabling the training of the U-Net part of the model. However, train_text_encoder is false, meaning the text encoder part of the model won't be trained, likely because it wouldn't benefit this specific configuration.
- gradient_checkpointing is enabled, which is necessary for training large models on GPUs with limited VRAM.
- The noise scheduler for training is set to “flowmatch,” and the optimizer is adamw8bit, which is optimized for 8-bit mixed precision training, improving memory efficiency.
- Learning rate (lr) is set to 1e-4, a common starting point for fine-tuning.
- EMA (Exponential Moving Average) is configured with use_ema: true and ema_decay: 0.
- EMA helps smooth out the learning process, though it can slow down convergence slightly.
- Model configuration includes the name_or_path pointing to 'black-forest-labs/FLUX.1-dev,' indicating the base model being fine-tuned. is_flux is true, confirming the model is a flux variant.
- quantize is true, enabling 8-bit mixed precision to reduce memory usage. There is an option to enable low_vram if the GPU is connected to monitors, which reduces VRAM usage but slows down training.
- The sample section specifies the parameters for generating sample outputs during training. The sampler is set to 'flowmatch,' matching the noise scheduler used in training. Samples will be generated every 250 steps with a width and height of 1024 pixels each.
- Prompts for generating samples are provided, with some containing the [trigger] placeholder, which will be replaced by the trigger word during sampling. These prompts guide the model in generating images with the desired vintage aesthetic.
You can also play around with some of these parameters to see how it affects learning.
Now, we are ready to run the fine tuning.
Stay in the loop
Great thing to keep in mind is that you can stop the training at any time using ctrl+c and when you resume, it will pick back up from the last checkpoint.
Let’s run it!
python3 run.py config/flux_vintage_aesthetics.yaml
After a few iterations, you will see that the output folder is populated with samples so you can inspect how FLUX started to pick up the style we want it to learn.
OK, let’s look a the results:
The transformation shown in the images reflects the effectiveness of the fine-tuning process given that I only used 40 images.
and before we get to the results, I want to quickly add:
What FLUX has learned during LoRA fine tuning?
Flux has successfully picked up on the styles, color schemes, and grading that you desired.
Here’s also what stands out about:
- Style Adaptation: There is a consistent vintage aesthetic. The model has clearly learned the visual cues and characteristics, such as color tones, lighting, and overall mood.
- Color and Grading: There is a shift towards warmer, more muted tones that are typical of vintage photography, which aligns well with fine-tuning goals.
- Content Interpretation: The model also appears to have effectively translated the content and themes from the original images while applying the new style, whether it’s a person holding a sign, enjoying a coffee, or exuding style in a nightclub setting.
Overall, the images show that the fine-tuning process has been highly successful in achieving a specific, cohesive look and feel.
If you use FLUX or other image-to-text models such as Midjourney, let me know in the comments!
Bonus: Building with LLMs
And don’t forget to have a look at some practitioner resources that we published recently:
Thank you for stopping by, and being an integral part of our community.
Happy building!