How does a neural network for generating images work? We explain in simple words in one school lesson
Automatic translate
In this express lesson we will go with you a little unusual way of studying the material, but believe me - it is tested, it is better this way. If you carefully read everything written below - you will fit into the "school" 45 minutes (that is how long a regular lesson lasts) and get all the basic knowledge and understanding of the main topics and the relationships between them.
Everything described here relates to image generation, and if you don’t get scared, in 45 minutes you will have a good grasp of the subject, without even touching on the technical details of how all these very complicated things actually work.
How does the diffusion model work?
A diffusion model is a method that “learns” to turn random noise into meaningful images (or other data) through gradual “cleaning.” Imagine taking a beautiful picture, adding more and more noise to it until it’s just a gray mess, and then learning to remove that noise step by step to restore the picture. That’s roughly how diffusion works.
How it happens step by step:
Noise-forward learning: First, the model takes real images (say, from a huge dataset) and gradually “destroys” them by adding noise. This is done in several steps — say, 1000 times — until the image is pure random noise. The model remembers how noise is added and how the image is “destroyed.”
Noise training backwards: Now the model learns to reverse the process — take that noise and remove it step by step to restore the original image. It looks at the noisy image at each step and predicts how to remove some of the noise to get closer to the original view. It’s like solving a puzzle by removing unnecessary pieces.
Generation from noise to image: Once the model is trained, you start with pure noise (just random numbers) and ask it to "clean up" it. It does this gradually, step by step (e.g. 50-1000 steps), until it gets a meaningful image. But here comes CLIP: it tells the model what exactly this image should be (e.g. "the cat in the hat"), guiding the cleaning process.
Working in latent space: in modern systems like Stable Diffusion, this process does not happen directly with pixels, but in a compressed (latent) form, which is created and deciphered by the VAE algorithm. This speeds up the work and makes it more efficient.
A simple analogy:
Imagine a sculptor starting with a lump of clay that looks like a shapeless mass (noise). He gradually gives it shape, removing excess and adding details until he has a statue. The diffusion model does the same thing, only with noise instead of clay, and "sculpts" the image based on a text prompt from CLIP.
Why does this work?
The good thing about diffusion models is that they improve the image step by step, rather than trying to create a perfect picture out of nothing. This makes the result better and more detailed than other methods.
Step by step:
- The diffusion model takes noise and gradually turns it into an image.
- CLIP sets the direction ("what to draw").
- VAE helps to translate the result from compressed form to the final image.
All together it’s like a well-coordinated team: CLIP is the director, diffusion is the artist, and VAE is the one who shows the finished work to the audience.
CLIP
CLIP is a model created by OpenAI that helps neural networks “understand” what’s in a picture or text and connect the two. When a neural network (like DALL E or Stable Diffusion) generates an image, CLIP acts as a “translator” between the text you input (like “cat in the hat”) and what the network should draw.
Here’s how it works at a simple level:
- You give a description : For example, "a dog is playing with a ball on the beach."
- CLIP analyzes the text : It “disassembles” your words and turns them into a kind of numerical code (embedding) that reflects the meaning of the description.
- Link to image : This code is passed to the generative neural network. It uses it as an "instruction" and starts drawing an image that must match this description.
- Checking the result : CLIP looks at the resulting image and compares it to your text query. If the image is not quite right (for example, there is a cat instead of a dog), the network adjusts its work to better "please" CLIP.
Essentially, CLIP is like the "eyes and ears" of the system: it helps the neural network understand what you want and ensures that the picture is as close as possible to your request. Without it, the network would draw something random, not understanding what is required of it.
VAE
VAE is another important "helper" in neural networks, especially in systems like Stable Diffusion. While CLIP is responsible for linking text and images, VAE helps the neural network itself "pack" and "unpack" images to make them easier and faster to work with.
Here’s how it works:
- Image compression : VAE takes an image (or whatever the network is trying to create) and "compresses" it into a compact set of numbers, called a latent representation. Think of it as a ZIP archive for an image: everything important is preserved, but it takes up less space.
- Working in "compressed form" : A generative network (like the one that draws a picture) works not with a huge number of pixels directly, but with this compressed code. This speeds up the process a lot, because it’s easier to process a small set of numbers than millions of pixels.
- Unpacking into an image : Once the network has finished "drawing" in this compressed form (given the instructions from CLIP), VAE takes this code and "unpacks" it back into a full image with colors, details and all.
- Adding Variations : Another cool thing about VAE is that it can add randomness. This allows the network to generate different versions of the same idea (like slightly different poses for a dog with a ball) rather than the same image every time.
In simple terms, VAE is like a “packer” and “unpacker” that helps the neural network work efficiently and turn abstract ideas into beautiful pictures. Without it, the process would be slow and complicated, and the result might look worse.
In short: CLIP tells you "what to draw", and VAE helps you "how to draw and show it". Together they do the magic of image generation!
VAE does not transform the latent representation before generation, but rather participates in the process "from both sides". Here’s how it works step by step:
- Start with noise : Generation does not start with a finished image, but with random "noise" - this is just a bunch of random numbers that looks like static noise on an old TV. This noise is already in a compressed (latent) format, because working with it directly as pixels would be too difficult.
- The role of CLIP : You enter a text query (e.g. "cat in the hat"), and CLIP turns it into a numeric code (embedding) that specifies the direction for generation.
- Latent space generation : A special part of the network (usually a diffusion model) takes this noise in latent format and gradually "cleans" it into a meaningful image. It does this based on instructions from CLIP. All this happens in a compressed form in that same latent space.
- VAE at the output : When the diffusion model has finished its work and received the final latent representation (the compressed code that already “looks” like the cat in the hat), VAE takes this code and “unpacks” it into a full-fledged picture with pixels, colors and details.
That is, VAE does not compress anything before generation. At the start, we already have compressed noise, and generation occurs in this compressed (latent) format. VAE is needed mainly at the final stage - to turn the result of the network’s work into a beautiful picture that you see. And VAE also helps initially train the network, showing how to "compress" real images into latent space, so that the network understands what to work with.
To put it simply: the generation happens in a compressed form (in latent space), and the VAE then "unfolds" it into a full-fledged image. The magic of "drawing" itself is the work of the diffusion model, not the VAE.
Refiner
A “refiner” in the context of generative neural networks like Stable Diffusion is an additional model that is used to improve the quality of already generated images. Let’s break it down in simple terms, what it is and why it’s needed.
What is Refiner?
Refiner is the second stage in the image generation process, which takes the raw output from the base model and refines it. The base model (for example, Stable Diffusion XL Base) creates an initial image from noise, guided by a text query (using CLIP and a diffusion model). But this result may not be very clear, with insufficient detail or minor errors. This is where Refiner comes in: it “polishes” the image, adding details, improving quality and removing excess noise.
In systems like Stable Diffusion XL (SDXL), the process is often divided into two stages:
- Base Model : Generates the base image in compressed (latent) form.
- Refiner : Takes this intermediate result and brings it to its final form, making the image sharper and more realistic.
How does Refiner work?
Refiner is also a diffusion model, but it is specially trained on high-quality data and works with less noise. It takes the latent representation (compressed code) from the base model and continues the process of "cleaning" the noise, but with an emphasis on fine details and high resolution. Sometimes a technique like SDEdit (img2img) is used to further improve the result.
In simple terms, if the base model draws a rough sketch, then the Refiner is like an artist who takes a brush and adds fine lines, shadows and textures.
Why is it needed?
- Improved quality : Without Refiner, images can look blurry, lacking detail, or with strange artifacts. Refiner makes them sharper and more professional.
- Time saving : The Basic model quickly creates the basis, and the Refiner spends additional effort only on refinement, which optimizes the process.
- Flexibility : You can use Refiner separately, for example, to improve ready-made images via img2img, if you need to correct something.
- Specialization : Refiner is often tailored to specific tasks, such as increasing resolution or adding realism, making it a useful addition.
Example:
Let’s say you asked for "cat in a hat." The base model gives you something that looks like a cat in a hat, but with fuzzy edges and fuzzy details. Refiner looks at that, refines the lines, makes the cat’s fur fluffy, and the hat has crisp folds. The end result is a more beautiful, detailed image.
Is it always needed?
Not necessarily. In some cases, the base model itself gives a good result, especially if the request is simple or you don’t need super details. But if you want top quality, especially for complex scenes or high resolution, Refiner is what takes the picture to the next level.
In short, Refiner is like a filter in a photo editor, only smart and automatic. It is needed to make your pictures look better and more professional.
cfg_scale (guidance_scale) parameter, sampler, scheduler components
These parameters are important settings in generative models that affect how the neural network creates images. Let’s take them one by one.
1. CFG Scale (Classifier-Free Guidance Scale)
What is it? It’s a parameter that controls how much the model listens to your text prompt. CFG Scale is responsible for the balance between "creativity" and "accuracy".
How does it work?
- Low CFG (eg 1-5) : The model is more "imaginative" and may deviate from your description, producing something unexpected, but sometimes more creative.
- High CFG (e.g. 10 – 20) : The model strictly follows the request, trying to embody as accurately as possible what you wrote.
Example: If you write "cat in a hat" with CFG = 3, the cat may end up in a strange hat or without one at all, but with CFG = 15, it will clearly be a cat in a hat, as you wanted.
Why is it needed? To control how well the image matches your text.
2. Sampler
What is it? It’s an algorithm that determines how the model "cleans" noise step by step to turn it into an image. In diffusion models, the generation process is a gradual removal of noise, and Sampler decides how to do it.
Popular options:
- DDIM (Denoising Diffusion Implicit Models) : Fast and high quality, but sometimes less detailed.
- Euler (or Euler a) : Simple and fast, gives consistent results.
- DPM++ (DPM++ 2M Karras) : More accurate and detailed, but may be slower.
- UniPC : New and optimized, often gives good quality in fewer steps.
How does it work? Different Samplers "step" from noise to image in different ways. Some do it faster, others - more precisely, others - with a special style.
Example: With DDIM the image can be produced in 20 steps, but with minor flaws, while with DPM++ in the same 20 steps it will be clearer.
Why do you need it? To choose speed or quality depending on your tasks.
3. Guidance Scale
What is it? This is another name for CFG Scale (some systems separate them, but most often they are the same). If there is a difference, Guidance Scale may refer to the strength of the text’s influence at each generation step.
How does it work? Similar to CFG: the higher the value, the more the model is “tied” to the query. Sometimes Guidance Scale is used as an additional parameter for fine-tuning.
Example: If CFG = 10 and Guidance Scale = 5, the model may slightly "soften" the strictness of following the text at certain stages.
Why is it needed? For more flexible control over the process (but usually it is just a synonym for CFG).
4. Scheduler
What is it? It’s a "schedule" that controls how fast or slow the model removes noise at different generation steps. The Scheduler works with the Sampler to determine how much noise to remove at each step.
Types:
- Linear : Noise is removed uniformly at all steps.
- Cosine : More attention to the beginning and end of the process to improve details.
- Karras : Optimized for quality, noise is removed unevenly, with an emphasis on important stages.
How does it work? Scheduler decides whether the model will "rush" at the beginning or end of generation. For example, Cosine makes the first steps smoother and the last ones more precise.
Example: With Linear Scheduler the image may look rougher, while with Karras it may look more polished.
Why is it needed? To adjust the balance between speed and quality, and to influence the style of the result.
How is all this connected?
CFG Scale / Guidance Scale: Controls how much the image "listens" to the text.
Sampler: Determines how exactly noise is converted into an image.
Scheduler: Controls in what order and at what speed this happens.
Simple example:
You want the "cat in the hat":
- CFG Scale = 12 : The model clearly draws the cat in the hat.
- Sampler = Euler a : Quickly generates a picture in 30 steps.
- Scheduler = Karras : Makes the process smooth, with an emphasis on detail.
Result: A crisp cat in the hat with good detail.
If you set CFG to 5, select DDIM for Sampler, and Linear for Scheduler, you’ll get something more abstract and rough, but faster.
How to use?
Experiment: Try different combinations to see what you like.
Typical values:
- CFG Scale : 7 – 15 (golden mean).
- Sampler : Euler a or DPM++ (popular choices).
- Scheduler : Karras or Cosine (for quality).
Balance: A high CFG with a fast Sampler and a simple Scheduler can give clarity without wasting time.
How the diffusion model works internally
A diffusion model is a type of neural network that works on the idea of gradually “cleaning up” noise. Its main task is to learn how to turn random noise into meaningful data (like images). Internally, it consists of several key components and processes. Here’s how it works:
1. Basic idea: direct and reverse process
The diffusion model works in two directions:
- Forward Diffusion: Takes real data (like a picture) and gradually adds noise to it until it becomes pure random noise. It’s like taking a photograph and blurring it into a gray mess.
- Reverse Diffusion: Learns to recreate the original data from this noise by removing the noise step by step. It’s like taking the noise and "painting" the picture back.
The model doesn’t just "spoil" and "fix" images - it learns to predict how to remove noise so that it can ultimately generate new images from scratch.
2. Architecture inside: U-Net
The main "worker" inside the diffusion model is a neural network called U-Net. It looks like the letter "U" because the data is first compressed and then expanded. Here’s what it does:
- Encoder : Takes a noisy image (or latent representation if using VAE) and analyzes it, extracting important features of shape, texture, color. It’s like looking at a picture and remembering the key details.
- U-Net "Bottom" : Here the data is maximally compressed, and the network decides how to "fix" it by removing noise.
- Decoder : Gradually restores the image, adding details back, but with less noise.
The good thing about U-Net is that it preserves fine detail information (thanks to the "bridges" between compression and expansion), which is important for generating clear images.
3. Footsteps and noise
Diffusion works with a fixed number of steps (e.g. 1000). Each step is a noise level:
In the forward process, the model knows how to add a little more noise at each step (this is given mathematically by a distribution, usually normal, like Gaussian noise).
In the reverse process, it learns to predict how to remove this noise. The input is a noisy image and a step number (timestep), so that the network understands how much noise there is already and how much needs to be removed.
4. Learning: How does she learn?
The model takes real images from the dataset. It adds noise to them at a random step (for example, step 500 out of 1000). U-Net looks at the noisy image and tries to predict what the previous step would look like (for example, step 499 with slightly less noise). The error between the prediction and reality is used to “teach” the network to remove noise better. This is repeated millions of times on different images and steps. Eventually, the model becomes an “expert” at cleaning up noise and can start with pure noise (step 1000) and get to the image (step 0).
5. Integration with text (CLIP)
In modern models like Stable Diffusion, diffusion doesn’t just generate random images — it’s text-based. This is where CLIP comes in:
CLIP turns your query (the "cat in the hat") into a numeric code (embedding). This code is fed to U-Net via an "attention" mechanism so that the network knows which features to emphasize (e.g., "hat" or "cat") while it removes noise.
6. Latent space (with VAE)
If VAE is used (as in Stable Diffusion):
The images are first compressed into a latent representation (compact code). U-Net works with this compressed code, not with the pixels directly. After the diffusion is complete, VAE "unpacks" the code into a full-fledged image. This speeds up the process and saves resources.
A simple analogy
Imagine that you are a restorer of old photographs:
You have a bunch of bad photos with different levels of "noise" (scratches, spots). You learn to remove these defects by looking at the originals and remember how to restore the details. Then you are given a completely "exposed" photo (pure noise), and you draw a new picture on it step by step, based on the description ("cat in the hat").
U-Net is your “eye” and “hand”, and CLIP is the client’s voice that says what should be in the picture.
Bottom line: what does it look like inside?
- Base : U-Net is a network that predicts how to remove noise.
- Process : Many steps from noise to picture (or vice versa when learning).
- Helpers : CLIP for text, VAE for compression/decompression.
- Mathematics : Behind the scenes of probability distributions and optimization (but this is for those who like formulas).
When you run the generation, the model starts with noise, uses U-Net to "clean" it up under the guidance of CLIP, and then VAE gives you the finished image. All this happens over tens or hundreds of steps, and each step is a little closer to the result.
Embedding – what is it?
Embedding is a way to turn something complex and difficult for a computer to understand (such as words, sentences, pictures) into a compact set of numbers that the machine can easily understand and use. It is like a “digital fingerprint” or “compressed description” that preserves the essence of the original information.
Imagine you have the word "cat." It’s understandable to humans, but a computer just sees letters. Embedding turns "cat" into a set of numbers, like [0. 23, -1. 45, 0. 89], where each number represents some aspect of the word’s meaning. These numbers aren’t random, but the result of neural network training.
How does this work?
- Training : A neural network (like CLIP or BERT) looks at a huge amount of data (text, images) and learns to find connections between them. For example, it notices that “cat” often appears next to “meow” or “fur”, and this influences the numbers in the embedding.
- Transformation : Once trained, the network can take a word, sentence, or even a picture and output a numeric vector (a list of numbers of a fixed length, such as 512 or 768 numbers) for it.
- Meaning in numbers : These numbers carry information about the meaning. For example, the embeddings of the words "cat" and "cat" will be similar (close numbers), but "cat" and "table" are completely different.
Example
- The word "cat" → [0. 23, -1. 45, 0. 89]
- The word "dog" → [0. 25, -1. 30, 0. 95] (similar to "cat" because they are both animals)
- The word "machine" → [1. 50, 0. 10, -2. 00] (completely different)
If you add or compare these vectors, you can understand how close the words are in meaning.
Where is it used in image generation?
In systems like Stable Diffusion, embeddings are the main format for storing and transmitting data. CLIP turns your text query (the "cat in the hat") into an embedding - a set of numbers that is passed to the diffusion model. It’s like an instruction: "draw something with such-and-such properties." CLIP can also create an embedding for an image to compare it with the text embedding and check how well the image matches the query.
A simple analogy
Embedding is like a translator that takes human language (words, images) and translates it into the “language of numbers” for the computer. Or like coordinates on a map: instead of describing a place in words (“forest next to the river”), you give a point [53. 5, 12. 3], and everyone understands what you’re talking about.
Why is this important?
- Compactness : Instead of processing an entire text or image, the model works with a small vector.
- Meaning : Embeddings preserve relationships and context (e.g. "cat" is closer to "meow" than to "tractor").
- Versatility : They can be used for text, images, sound, anything.
Now let’s look at how models like Stable Diffusion, CLIP, or VAE are trained, and whether humans are involved to label the data. The answer depends on the type of model and the stage of its creation.
1. Data Labeling: Do You Need People?
Yes, people participate, but not always directly. Most modern models are trained using huge data sets, often collected from the Internet. For example, Stable Diffusion took millions of images with captions from open sources such as websites, social networks, or archives (for example, LAION-5B — a dataset with 5 billion “picture-text” pairs). This data already has “markup” in the form of text descriptions created by people (for example, photo captions on Instagram or alt-text on websites). That is, there may not be any direct manual markup specifically for the model — the data is taken “ready-made”.
Automation. The collection of such data is automated: special programs (scripts, crawlers) go around the Internet, download pictures and their descriptions. People do not sit and sign each picture manually for training - this would be too long and expensive.
2. How does it work for different models?
CLIP (text + pictures):
CLIP learns to link text and images. To do this, it is given “picture-text” pairs (for example, a photo of a cat with the caption “cat in the hat”). These pairs already exist on the Internet, and they do not need to be marked up again — the model simply learns to find patterns between text and image.
People participated indirectly: they once created these captions for their posts or websites. But for the CLIP training itself, the markup is automated.
Diffusion models (Stable Diffusion):
Diffusion models learn to "clean up" noise and create images. To do this, they don’t need labeling in the classical sense ("this is a cat", "this is a dog"). Instead, they take images and add noise to them, and then learn to restore the original. All this happens automatically.
However, textual conditionality (i.e. generation on request like "cat in the hat") is added via CLIP. Here ready-made "picture-text" pairs from the Internet are used again, without manual markup.
VAE (Variational Autoencoder):
VAE learns to compress and reconstruct images. It is given images, and it "figures out" how to encode and decode them. There is no need for marking at all - only the images themselves, which are assembled automatically.
3. When are people really needed?
Data cleaning. Sometimes datasets contain garbage (for example, broken files, irrelevant images, or bad captions). People can manually filter or improve such data, but this is done rarely and only to improve quality. Cleaning is also mostly automated (for example, algorithms remove duplicates or empty images).
Fine-tuning. If a model needs to be adapted to a specific task (for example, to generate only realistic cats), developers can take a small dataset and label it manually. For example, hire people to label 1,000 pictures of cats with details (“ginger cat,” “cat with a bow”). But this is no longer basic training, but an improvement.
Quality control. After the model is created, people check the results and can adjust the learning process by adding new data or changing the approach. But this is not labeling, but rather analysis.
4. Example: Stable Diffusion
- Dataset : LAION-5B 5 billion images with captions collected from the Internet.
- Markup : The captions were already on the Internet (for example, "cat at sunset" under a photo on a social network). No one marked them up specifically for the model; everything was taken as is.
- Process : CLIP learned to link text and images, and the diffusion model worked with noise. Everything is automated.
5. Bottom line: automation vs. people
Mainly automation. Modern models like Stable Diffusion or CLIP are trained on huge amounts of data that already contain "markup" from people (signatures, tags). No one sits and signs billions of pictures specifically for training - the Internet does it all by itself.
Humans are needed indirectly. They create the initial data (uploading photos with captions to the network) or sometimes help with cleaning and tuning. But the basic training is the work of the algorithms.
In simple terms: the models “eat” what’s already on the internet and learn on their own, without constant human intervention. If everything were labeled manually, creating such models would take years and millions of dollars. Automation is the key to their success.
Transformers
Transformers are a type of neural network architecture that were designed to work with sequences of data (like words in a sentence). They “understand” the relationships between pieces of data, even if those pieces are far apart, and they do so quickly and efficiently. They were first introduced in Google’s 2017 paper “Attention is All You Need.”
In simple terms, it’s like a smart "translator" that doesn’t just look at individual words, but understands the entire context of the sentence.
How do they work?
Transformers consists of several key ideas:
Attention mechanism:
This is the "brain" of the transformer. It decides what to pay more attention to in the data. For example, in the sentence "The cat that sleeps on the couch is cute," the transformer understands that "cute" refers to "cat" and not to "couch."
Instead of processing words one by one (like older RNN-type models), the transformer looks at everything at once and figures out which parts are more important.
Encoder and Decoder:
- Encoder : Turns input data (e.g. text) into a set of numbers that contains their meaning (embeddings).
- Decoder : Takes these numbers and turns them into output (such as a translated text or an image).
Some tasks use only the encoder (for example, for text analysis), while others use both.
Parallelism:
Old models (RNN, LSTM) processed data step by step, which was slow. Transformers process everything at once, which speeds up the work.
Layers:
The Transformer is made up of many layers (like a pie), where each layer improves the understanding of the data by adding details and relationships.
Example: How a transformer understands text
Let’s say you typed "The cat in the hat walks down the street." Transformer breaks it down into words. Turns each word into numbers (embeddings). Looks at how the words are related: "in the hat" refers to "the cat," and "goes" refers to the action. It gives you a result, such as the English translation: "The cat in the hat walks down the street."
Where are transformers used?
Models like BERT (Google) or GPT (OpenAI) are transformers. They translate, write texts, answer questions. Example: ChatGPT is a transformer that “understands” and generates text.
In Vision Transformers (ViT), images are broken into pieces and the transformer analyzes them like words in a sentence. Example: Recognizing objects in photos.
In models like DALL E or FLUX, transformers help to link text to image. They work together with diffusion models to guide the generation process.
How are transformers related to image generation?
In models like FLUX. 1 or DALL E, the transformer takes your text (the "cat in the hat") and turns it into an embedding - a numerical instruction. This instruction is passed on to the diffusion model, which "draws" the picture based on it. The transformer can also participate in the diffusion itself (as in FLUX. 1), improving the process of "cleaning" the noise.
In simple terms, the transformer is the "conductor" that tells the diffusion model what exactly to draw.
Why are they so important?
- Speed : Process data in parallel rather than sequentially.
- Context : Understand relationships over long distances (e.g. in long texts or complex scenes).
- Flexibility : Works with text, pictures, sound, anything.
A short story
- 2017 : Transformers appeared in a Google article.
- 2018 : BERT and GPT showed their power on texts.
- 2020 – 2021 : Vision Transformers (ViT) applied them to pictures.
- 2022 – 2024 : Transformers integrated into generative models (DALL E, FLUX. 1).
A simple analogy
A transformer is like a librarian who instantly finds the book (data) you need, understands what it is about, and can retell it to you or draw a cover. He doesn’t read one page at a time, but "sees" everything at once.
Transformer T5-XXL
T5-XXL (Text-to-Text Transfer Transformer) is a true transformer model developed by Google.
- Architecture : Completely based on transformers with encoder and decoder.
- Feature : Can convert text to text (e.g. translate, rewrite, answer questions). All in the format "text in, text out".
- Size : The XXL version is a huge model with 11 billion parameters, making it one of the most powerful in the T5 family.
- Usage : In image generation, it is used as a text encoder to deeply understand complex queries and turn them into embeddings.
The T5-XXL is a "classic" transformer optimized for word processing.
CLIP + T5-XXL: what is it together?
In systems like FLUX.1 or Stable Diffusion 3 these two models are often combined as text encoders:
- CLIP : Gives a general understanding of the text and the connection with the image. It is good for short descriptions and general meaning.
- T5-XXL : Deepens text analysis, especially for long and complex queries (e.g. "cat in hat walking on the beach at sunset"). It better understands details and context.
How they work together: Text passes through both models, each creating its own embeddings. These embeddings are passed to the diffusion model, which draws the picture. T5-XXL is usually responsible for the precision of the details, and CLIP is responsible for the overall "picture".
In FLUX.1, for example, they are used in tandem: CLIP (usually ViT-L/14) handles the text for general direction, and T5-XXL adds detailed understanding. This is not a single "CLIP T5-XXL" model, but two transformer components working in parallel.
CLIP is good at linking text and images, but weak with long queries. T5-XXL is stronger at handling complex text, but is not related to images on its own. Together, they provide better text understanding for generating accurate and detailed images.
Neural Model Formats: Safetensors, GGUF and Others
1. Safetensors
- What is it? Safetensors is a format developed by Hugging Face to safely and quickly store tensors (the numeric arrays that make up models). It was created as an alternative to the old PyTorch.pt /.pth format, which used Python Pickle.
Peculiarities:
- Security : Unlike Pickle, Safetensors cannot contain malicious code that could run when a file is downloaded.
- Speed : Very fast loading thanks to "zero-copy" (data is read directly into memory without unnecessary copies).
- Simplicity : Stores only tensors (model weights) and a minimum of metadata, without complex structure.
- Cross-platform : Written in Rust, works not only with Python, but also with other languages.
What is it for? Typically used to store "raw" models (e.g. fp16 or fp32 format) before further processing or quantization. It is a popular choice for Hugging Face models.
Example: The Stable Diffusion XL model is often distributed as safetensors.
2. GGUF (GPT-Generated Unified Format)
- What is it? GGUF is a binary format developed by Georgi Gerganov for storing models optimized for inference on regular computers. It is the successor of GGML, but more modern and flexible.
Peculiarities:
- Optimization : Designed to load and run quickly on CPU or GPU, especially with quantization (eg 4-bit or 8-bit) to reduce model size and speed up execution.
- Metadata : Stores not only tensors, but also model information (architecture, tokenizer, settings), making it an all-in-one.
- Extensibility : New data (such as custom tokens) can be added without breaking compatibility with older versions.
- Support : Used in tools like llama.cpp to run Large Language Models (LLM) locally.
What is it for? To run models on devices with limited resources (e.g. laptops). It is a popular format for quantized versions of models such as LLaMA or Mixtral.
Example: A file like llama-2-7b-chat. Q4_K_M. gguf is a GGUF with 4-bit quantization.
3. GGML (predecessor of GGUF)
- What is it? GGML is an older format, also by Georgi Gerganov, which was used before GGUF. It was a C tensor library for model inference.
Peculiarities:
- Simplicity : Stored tensors and basic metadata, supported quantization (4-bit, 8-bit).
- Limitations : Not as flexible as GGUF, supported mostly LLaMA architecture, and metadata was less structured.
- Status : Deprecated and completely replaced by GGUF in 2023.
What is it for? It was used to run models locally on the CPU, but it is now a historical format.
4. PyTorch (.pt/.pth)
- What is it? It is a standard format for storing models in PyTorch, based on Python Pickle.
Peculiarities:
- Versatility : Can store not only tensors, but also any Python structure (e.g. code, dictionaries).
- Unsafe : Pickle may contain malicious code that will execute when downloaded, making it risky to distribute publicly.
- Size : Usually larger, as it stores data in raw form (fp32 or fp16).
What is it for? For training and fine-tuning models in PyTorch. This is the "source" format, which is then often converted to Safetensors or GGUF.
Example: preparing a model with weights before quantization.
Comparison
Format | Safety | Loading speed | Metadata | Quantization | What is it better for? |
---|---|---|---|---|---|
Safetensors | Tall | Very high | Minimum | No | Storing "raw" models |
GGUF | Average | Tall | Many | Yes | Local inference |
GGML | Average | Average | Basic | Yes | Obsolete, was for output |
PyTorch.pt | Low | Average | Flexible | No | Training and Development |
Other formats
- ONNX (Open Neural Network Exchange) : An open format for exchanging models between frameworks (PyTorch, TensorFlow, etc.). Used for portability, but not for fast inference.
- EXL2 (ExLlamaV2) : Format for quantized models, often stored in Safetensors. Faster than GGUF on GPUs, but harder to use.
- AWQ : Another quantization method for GPUs, often paired with Safetensors. Competitor to GGUF in speed.
A simple analogy
- Safetensors : How a ZIP archive with pictures is safe, opens quickly, but inside there is only data.
- GGUF : Like a finished movie on a disc with subtitles and settings, everything is included and optimized for viewing.
- PyTorch.pt : As an artist’s working folder, everything is there, but it’s inconvenient to share, and someone might slip in a "virus".
Why choose one or the other?
- If you are a developer and training a model, you will start with pt/pth, then move to Safetensors for storage.
- If you are a user and want to run the model on your PC, choose GGUF to make everything work quickly and easily.
Quantization
- What is it? Quantization is the process of reducing the precision of numbers in a model so that it takes up less space and runs faster. Imagine rounding numbers : instead of 3.14159, you use 3.14.
How does it work?
Model weights (the numbers the network has "learned") are typically stored at high precision (e.g. 32 bits per number - fp32). When quantized, they are converted to lower precision: 16 bits (fp16), 8 bits (int8), or even 4 bits (int4).
It’s like painting a picture with hundreds of colors instead of millions - the quality is slightly worse, but the difference isn’t always noticeable.
Why is it needed?
- Reduces the size of the model (e.g. from 13GB to 4GB).
- Speeds up work, especially on weak devices (CPU, GPU).
- Saves memory.
Example: A model in GGUF with Q4 (4-bit) quantization works on a laptop, but in fp32 it would require a powerful server.
- FP32 (32 bits) "full precision" as 3.1415926535.
- FP16 is approximately 3.14.
- FP8 is even rougher, like 3.1.
Tensors
- What is it? Tensors are multidimensional arrays of numbers that are used to store data in neural networks. They are like tables or cubes of numbers that store information about the model.
In simple words:
- A number (for example, 5) is a 0D tensor (scalar).
- A list of numbers (eg [1, 2, 3]) is a 1D tensor (vector).
- A table (eg [[1, 2], [3, 4]]) is a 2D tensor (matrix).
- "Cube" or more dimensions are tensors 3D, 4D and so on.
Why are they needed?
The model’s weights (what it "learned") are stored in tensors. Images can also be represented as tensors: for example, 256x256x3 (width, height, RGB colors).
Example: In Safetensors, a model is a set of tensors, like huge tables of numbers.
Tokens
- What is it? Tokens are the pieces of data that text is broken down into so that the neural network can understand it. They are like "words" to the machine, but not always ordinary words.
How does it work?
A text like "The Cat in the Hat" is broken down into tokens: ["Cat", "in", "hat"] or even smaller pieces (e.g. "Ko", "t"). Each token is assigned a number (ID) from the vocabulary that the model has learned. These numbers are fed into the network as input.
Why are they needed?
To translate human language into numbers that the model works with. In image generation, tokens from text (via CLIP or T5) are turned into embeddings to create an image.
Example: In the query "Cat in the Hat", each token helps the model understand that it should draw a cat and a hat.
History of Generative Neural Networks
1. Early steps: before the 2010s
Before true generative networks emerged, scientists experimented with models that could “think things up.” These were more mathematical toys than practical tools:
Boltzmann machines (1980s): Created by Geoff Hinton and colleagues, these were simple networks that learned to model data (such as black and white patterns). They didn’t generate images, but they laid the groundwork for future ideas.
What could they do? Almost nothing visual - just abstract patterns. That was the foundation.
2. 2014: GANs (Generative Adversarial Networks) – the revolution begins
Who created it? Ian Goodfellow and his team in 2014.
What were they called? GANs - generative adversarial networks.
How did they work? Two networks competed: one (the generator) created images from noise, the other (the discriminator) checked whether they looked real. They "trained" each other until the generator started producing something plausible.
What could they do? The first GANs generated simple images: blurry faces, numbers (for example, from the MNIST dataset), or rough pictures of objects. The quality was low, but it was a breakthrough — the network itself “invented” the images.
Example: A blurry human face or a crooked cat that looked like a blot.
3. 2015 – 2017: Improving GANs
GANs have evolved rapidly and new versions have emerged:
- DCGAN (Deep Convolutional GAN, 2015) : Created by Alec Radford and colleagues. Used convolutional networks (CNN), which improved the quality. Able to generate sharper images: faces, cars, animals, but still with artifacts.
What could they do? For example, 64x64 pixel faces that could already be recognized as faces, or room interiors.
Problems: Unstable learning - pictures could turn out strange or "broken".
- Progressive GANs (2017) : From Tero Karras and the NVIDIA team. The network learned to generate images gradually, from low resolution (4x4) to high resolution (1024x1024).
What could they do? Realistic human faces in high resolution. This was the first time that generated faces looked almost like real photographs.
4. 2018 – 2019: BigGAN and StyleGAN – Quality is increasing
- BigGAN (2018) : Created by Google team (Andrew Brock et al.). Used huge resources and datasets (ImageNet).
What could it do? Generate complex scenes and objects: dogs, landscapes, food. The quality was high, but it required powerful computers.
Example: A realistic dog or flower, although sometimes with minor errors.
- StyleGAN (2019) : Again from Tero Karras and NVIDIA. Added control over the image style (for example, you could change the hairstyle or age of the face).
What could it do? Incredibly realistic faces (up to 1024x1024) that are hard to tell apart from a photo. It also generated cats, cars, even anime characters.
Feature: Now you can “edit” pictures, for example, make a face look older or add glasses.
5. 2020: VAE and first steps towards the text
- VAE (Variational Autoencoders) : Originally proposed in 2013 (Diederik Kingma and Max Welling), but became popular for generation later. They compressed images into a latent space and recreated them.
What could they do? Simple images like handwritten digits or faces, but the quality was worse than GANs. But they were more stable.
Connection with text: They started combining VAE with text, but it was not yet widespread.
6. 2021: DALL E and CLIP — text meets pictures
- DALL E (2021) : Created by OpenAI. Combined VAE and transformers (as in language models). Used CLIP to link text and images.
What could it do? Generate images based on text queries: "avocado in the form of a chair" or "cat in space". The quality was average (256x256), but the creativity was incredible.
CLIP: Also from OpenAI. Didn’t generate it myself, but learned to understand the connection between text and images, becoming the "brain" for future models.
7. 2022: Diffusion Takes Over
- DDPM (Denoising Diffusion Probabilistic Models, 2020 – 2021) : Introduced by Jascha Sohl-Dickstein and improved by Jonathan Ho, these were the first diffusion models.
What could they do? They generated high-quality images (faces, objects), but slowly - hundreds of steps.
Breakthrough: Showed that diffusion can outperform GANs in detail.
- Stable Diffusion (2022) : Created by Stability AI in collaboration with researchers (Robin Rombach et al.). Combined diffusion, VAE, and CLIP.
What could it do? Generated images from text (512x512 and higher) quickly and efficiently. Worked on regular computers thanks to latent space.
Feature: Open source - anyone can download and use it.
- DALL E 2 (2022) : OpenAI improved the first version. Used diffusion instead of VAE.
What could it do? Very realistic and creative pictures (up to 1024x1024) for complex queries: "penguin on the beach with a cocktail".
- Imagen (2022) : From Google. Also a diffusion model with an emphasis on quality.
What could it do? Photorealistic images with incredible detail.
8. 2023 – 2025: Modernity
- Stable Diffusion XL (SDXL, 2023) : Improved version from Stability AI. Added Refiner for detailing.
What can it do? 1024x1024 images with high quality and flexibility.
Midjourney (2022 – 2025): Closed model, but popular among artists. Specializes in artistic styles.
- DALL E 3 (2023) : An even more accurate and powerful version from OpenAI.
Evolution: from simple to complex
- Early models (GANs) : Simple faces, low resolution, lots of errors.
- Middle (StyleGAN, BigGAN) : Realism, but no text.
- Text + images (DALL E, Stable Diffusion) : Connection with requests, creativity.
- Today : High quality, speed, availability.
Who drove progress?
- Scientists : Ian Goodfellow (GANs), Geoff Hinton (basics), Tero Karras (NVIDIA).
- Companies : OpenAI (DALL E, CLIP), Stability AI (Stable Diffusion), Google (Imagen), NVIDIA (StyleGAN).
What has changed?
- From blurry spots to photorealism.
- From random images to text generation.
- From supercomputers to home PCs.
FLUX
FLUX is a state-of-the-art generative AI model introduced in August 2024 by Black Forest Labs, a company founded by former Stable Diffusion developers (Robin Rombach, Andreas Blattmann, etc.). It is a text-to-image model with a hybrid architecture combining transformers and diffusion techniques, with a scale of 12 billion parameters. Its goal is to generate high-quality images from text descriptions (prompts) with high accuracy and diversity.
- Quality : Produces highly detailed and realistic images comparable to Midjourney or DALL E 3.
- Accuracy : Better prompt adherence than previous models like the Stable Diffusion XL.
- Speed : There is a fast version (Schnell) that works even on regular computers.
- Flexibility : Offers three options:
- FLUX.1 Pro : The most powerful, for commercial use, available via API.
- FLUX. 1 Dev : Open for non-commercial use, for experiments.
- FLUX. 1 Schnell : Fast and lightweight, Apache 2.0 licensed, for personal use.
How does one learn?
Like Stable Diffusion, FLUX. 1 is trained on a huge dataset — likely billions of images with text descriptions from the Internet (e.g. LAION). The exact details of the data are not public, but the training is automated, without manual labeling by humans — the model “absorbs” ready-made text-image pairs.
What can it do?
- Generates photorealistic scenes, people, animals.
- Draws hands and faces well (a problem with older models).
- Handles text on images (e.g. inscriptions, logos).
- Supports different styles from realism to abstraction.
FLUX is the next step after GANs (2014), StyleGAN (2019), and early diffusion models (2020). It is based on Stable Diffusion (2022), but with improved architecture and efficiency. As of today (March 2025), it is one of the leading open source models, successfully competing with paid solutions like Midjourney.
You cannot comment Why?