Logo
Datadrifters Blog Header Image

Fine Tuning FLUX: Personalize AI Image Models on Minimal Data for Custom Look and Feel

2024-08-15

Black Forest Labs (alumni of Stability AI) launched FLUX.1, an open-sourced suite of AI image generation models that you can run locally.


FLUX models took socials by storm, because the largest one, FLUX.1 [pro], outperformed Stable Diffusion 3 Ultra, Midjourney v6.0, and DALL·E 3 HD.


Look at the benchmarks, absolutely crazy!



There are 3 models:



and all of them are based on hybrid multimodal transformer blocks:



There are lots of LoRAs and extensions released as you read this article, and people say it’s noticeably better than Midjourney, and my experience is also very… very promising!


FLUX models democratize access to cutting-edge generative AI research, and push the limits of text-to-image synthesis.


The fun fact is that Black Forest CEO Robin Rombach also co-authored key papers on VQGAN, latent diffusion, adversarial diffusion distillation, Stable Diffusion XL, and Stable Video Diffusion.


We will see a wave of open versions of Flux, that’s why I wanted to walk you through Flux fine tuning using LoRA locally.


Let’s GOOOOO!



Setting up the environment for Flux LoRA fine tuning


First things first, here’s what you should already have:



Once everything is in place, open your command line interface — this could be your Command Prompt, Terminal, or any other CLI tool you’re comfortable with — and run the following commands:


git clone https://github.com/ostris/ai-toolkit.git  
cd ai-toolkit  
git submodule update --init --recursive  

python3 -m venv flux-finetune-env  
source flux-finetune-env/bin/activate  

pip3 install torch  
pip3 install -r requirements.txt

You should now be looking at the full project as the following:



Now, we will set up the Hugging Face:




and add it to the .env file:


HF_TOKEN=hf_jpTKpr....


and we are ready to go!


Dataset preparation for fine tuning FLUX models


Here’s how we will prepare our dataset according the guidance shared in the repository:



In the end, your data folder should look like the following:



For each image, there should be .jpg file and corresponding .txt file that contains captions.



I browsed internet for the pictures (40) and asked ChatGPT to create captions and generate .txt files to download.


You can find the full dataset here.


As you can see, I’m going for that vintage look & feel here.


Configuring FLUX fine tuning process


First, let’s copy the example config file located at config/examples/train_lora_flux_24gb.yaml to the config folder and rename it to flux_vintage_aesthetics.yaml


and make the following edits:


---  
job: extension  
config:  
  # this name will be the folder and filename name  
  name: "flux_vintage_aesthetics"  
  process:  
    - type: 'sd_trainer'  
      # root folder to save training sessions/samples/weights  
      training_folder: "output/vintageae"  
      # uncomment to see performance stats in the terminal every N steps  
      performance_log_every: 1000  
      device: cuda:0  
      # if a trigger word is specified, it will be added to captions of training data if it does not already exist  
      # alternatively, in your captions you can add [trigger] and it will be replaced with the trigger word  
      trigger_word: "v1nt4g3"  
      network:  
        type: "lora"  
        linear: 32  
        linear_alpha: 32  
      save:  
        dtype: float16 # precision to save  
        save_every: 250 # save every this many steps  
        max_step_saves_to_keep: 4 # how many intermittent saves to keep  
      datasets:  
        # datasets are a folder of images. captions need to be txt files with the same name as the image  
        # for instance image2.jpg and image2.txt. Only jpg, jpeg, and png are supported currently  
        # images will automatically be resized and bucketed into the resolution specified  
        # on windows, escape back slashes with another backslash so  
        # "C:\\path\\to\\images\\folder"  
        - folder_path: "data"  
          caption_ext: "txt"  
          caption_dropout_rate: 0.05  # will drop out the caption 5% of time  
          shuffle_tokens: false  # shuffle caption order, split by commas  
          cache_latents_to_disk: true  # leave this true unless you know what you're doing  
          resolution: [ 512, 768, 1024 ]  # flux enjoys multiple resolutions  
      train:  
        batch_size: 1  
        steps: 4000  # total number of steps to train 500 - 4000 is a good range  
        gradient_accumulation_steps: 1  
        train_unet: true  
        train_text_encoder: false  # probably won't work with flux  
        gradient_checkpointing: true  # need the on unless you have a ton of vram  
        noise_scheduler: "flowmatch" # for training only  
        optimizer: "adamw8bit"  
        lr: 1e-4  
        # uncomment this to skip the pre training sample  
        skip_first_sample: true  
        # uncomment to completely disable sampling  
#        disable_sampling: true  
        # uncomment to use new vell curved weighting. Experimental but may produce better results  
        linear_timesteps: true  

        # ema will smooth out learning, but could slow it down. Recommended to leave on.  
        ema_config:  
          use_ema: true  
          ema_decay: 0.99  

        # will probably need this if gpu supports it for flux, other dtypes may not work correctly  
        dtype: bf16  
      model:  
        # huggingface model name or path  
        name_or_path: "black-forest-labs/FLUX.1-dev"  
        is_flux: true  
        quantize: true  # run 8bit mixed precision  
#        low_vram: true  # uncomment this if the GPU is connected to your monitors. It will use less vram to quantize, but is slower.  
      sample:  
        sampler: "flowmatch" # must match train.noise_scheduler  
        sample_every: 250 # sample every this many steps  
        width: 1024  
        height: 1024  
        prompts:  
          # you can add [trigger] to the prompts here and it will be replaced with the trigger word  
          - "[trigger] holding a sign that says 'I LOVE VINTAGE!'"  
          - "[trigger] with red hair, playing chess at the park, next to a vintage white car on a tree-lined street"  
          - "[trigger] holding a coffee cup, in a beanie, sitting at a cafe"  
          - "[trigger] exudes elegance in a light beige suit and oversized hat, as a DJ at a night club, fish eye lens, smoke machine, lazer lights, holding a martini"  
          - "[trigger] showing off his cool new t shirt at the beach, a shark is jumping out of the water in the background"  
          - "A couple walks hand in hand through a charming countryside setting in the snow covered mountains"  
          - "[trigger] wearing short-sleeved white shirt tucked into high-waisted black trousers, playing the guitar on stage, singing a song"  
          - "hipster man with a beard, dressed in chic and retro-inspired outfits, building a chair, in a wood shop"  
          - "A stylish couple walks hand in hand outside a luxurious building. The woman wears a black dress and wide-brimmed hat, while the man is dressed in a tailored brown blazer and dark trousers."  
          - "A woman enjoys a sunny day by the river, leaning back with her face to the sun. She is dressed in a loose blue shirt and white shorts, with a woven bag by her side, capturing a carefree moment."  
          - "[trigger] stands confidently on a boat, holding onto a beam as the wind blows through its hair. He is dressed in a blue and white striped shirt paired with white trousers, exuding a nautical style."  
        neg: ""  # not used on flux  
        seed: 42  
        walk_seed: true  
        guidance_scale: 4  
        sample_steps: 20  
# you can add any additional meta info here. [name] is replaced with config name at top  
meta:  
  name: "[vintageae]"  
  version: '1.0'

Let’s briefly have a look at what’s going on in the configuration file:



You can also play around with some of these parameters to see how it affects learning.


Now, we are ready to run the fine tuning.


Stay in the loop


Great thing to keep in mind is that you can stop the training at any time using ctrl+c and when you resume, it will pick back up from the last checkpoint.


Let’s run it!


python3 run.py config/flux_vintage_aesthetics.yaml

After a few iterations, you will see that the output folder is populated with samples so you can inspect how FLUX started to pick up the style we want it to learn.


OK, let’s look a the results:



The transformation shown in the images reflects the effectiveness of the fine-tuning process given that I only used 40 images.


and before we get to the results, I want to quickly add:


If you want to build a full-stack GenAI SaaS Products that people love — don’t miss out on our upcoming cohort-based course. Together, we’ll build, ship, and scale your GenAI product alongside a community of like-minded people!


What FLUX has learned during LoRA fine tuning?


Flux has successfully picked up on the styles, color schemes, and grading that you desired.


Here’s also what stands out about:



Overall, the images show that the fine-tuning process has been highly successful in achieving a specific, cohesive look and feel.


If you use FLUX or other image-to-text models such as Midjourney, let me know in the comments!


Bonus: Building with LLMs


And don’t forget to have a look at some practitioner resources that we published recently:


Say Hello to ‘Her’: Real-Time AI Voice Agents with 500ms Latency, Now Open Source

Fine-Tune Meta’s Latest AI Model: Customize Llama 3.1 5x Faster with 80% Less Memory

Fine Tuning FLUX: Personalize AI Image Models on Minimal Data for Custom Look and Feel

Data Management with Drizzle ORM, Supabase and Next.js for Web & Mobile Applications


Thank you for stopping by, and being an integral part of our community.


Happy building!