Making images with CLIP
In the early days of this year’s January, two posts appeared at the same time on OpenAI’s blog. One of them introduced neural network called CLIP that combines visual concepts with natural language, while the other one was about another new network called DALL-E that could produce hyper-realistic images from natural language instructions. CLIP was used to train DALL-E’s images by scoring them according to how well they represented the instructions. Soon OpenAI released CLIP to the general public.
Basically, CLIP is a machine learning model that can match text to images. It was trained with 400 million instances of images paired with text collected from the Internet. In addition to that it has a special power called “zero shot classification” which means that it can actually recognize images that it has not seen during the training. This is possible because the image captions that it already knows are not treated as only passive strings of characters, but they are encoded into a transformer model that can then extract their semantic meaning. Because of this, you can give it an arbitrary description like “a photo of fishbowl” and some random image and it will return a score that is the probability of the image agreeing with the description.
What OpenAI did with DALL-E was to use CLIP as an image-making tool by pairing it with a generative (image-producing) model. You give a text prompt to the model, and it tries to produce image that fits the prompt while CLIP acts as a “critic” evaluating the model’s performance. This is done in the loop and the image updates itself so as to maximize the CLIP score. One ends up with a picture that is an interpretation of the prompt. These interpretations can get pretty interesting.
In just a couple of months, we’ve had an explosion of Google Colab notebooks that improvise on this basic idea. In this post I go through the three most well-known of them, which are also the ones I use on regular basis.
- Big Sleep
Big Sleep uses BigGAN as the generative model. This is the network that can produce very convincing images of non-existing soap bubbles, glasses of wine or plates of spaghetti (all in all there are 1000 such image categories) as well as surreal visions from the latent space where neurally encoded images live and breathe.
Here is a typical generation sequence from Big Sleep. It starts from a jumble of images – usually dogs – that correspond to the intial random distribution of values in the representational space of the network. In my music videos, I like to use these initial rapid transformations to accentuate points of change in the music. They will eventually converge to a picture that CLIP thinks fits the prompt, which in this case is “a castle in the clouds”.
OpenAI has not published the entire model they used in DALL-E, but just one part of it, the variational autoencoder (VAE). What this does is to take an image, encode it to a compressed reprentation and then decode it. The real powerhouse of DALL-E is the massive language model that uses the VAE in its pipeline to render images that are then evaluated by CLIP – but as said, it is not available. For now, all we have is the VAE but we can use its decoder part to construct images.
This is just what Aleph (also known as Aleph2Image) does. It is another project by Ryan Murdock who also made Big Sleep, and very much experimental work in progress. Images with Aleph do not always have the same consistency as those made with Big Sleep, but they definitely have their own quirky charm. I especially like the unique concentrations of glowing colors that have a celestial quality to them.
Below, a generation sequence from an unreleased version of Aleph. Once again, the prompt is “a castle in the clouds”. Notice how the image becomes more polished as the generation advances and the colors get brighter.
If you like Murdock’s work and wish to support him, I encourage you to join his Patreon group where you too can access several of his non-public notebooks.
Aphantasia does not depend on GAN or VAE but it constructs images by decomposition into sine waves using Fast Fourier Transform. The waves or frequencies are then optimized based on evaluation by CLIP. The functionality comes from a set of tools called Lucent, which is a framework for “feature visualization”, but it does not play any further role here.
Despite the simplicity of the generator, the result is a globally filled canvas with a panoramic and decorative characteristics.
You can also supply a starting image – in the example below I have used Rubik’s Cube. What’s more, you can save and load the parameters associated with generated image to provide a starting point for another image. In another words, after generating one image with a prompt you can use that result as a starting point for generating another image with different prompt.
This last fact suggests that you could illustrate a longer story with morphing animations and this what the Aphantasia’s successor – or rather, the its companion Illustra does. With Illustra, you upload a text file and the program makes rendition of its contents line by line, blending seamlessly the transformations. Here’s how Illustra did the beginning of the nursery rhyme “Old King Cole” :
(check out also Story2Hallucination that does similar turning longer text into series of morphing animations by employing Big Dream under the hood).
Both the Aphantasia and Illustra notebooks are very easy and convenient to use, totally form-based with plenty of options to play with. They even output video. Lastly I want to showcase you the richness of detail that one gets when zooming into Aphantasia produced imagery.
These are by means not the only Text2Image tools out there. For example, I left out Deep Daze (CLIP+SIREN), mainly because I haven’t used it myself. But the ones I’ve introduced here are the big three right now – although the situation keeps changing quickly. Even as I was writing this post, I learned that there is work underway to drive vector graphics with CLIP. If your appetite is whetted, here is a comprehensive list of CLIP-controlled image synthesizers.
Maybe at same later point I’ll make posts investigating each of these in more detail, especially how to animate with them. For now, I hope this has been somewhat useful overview. Until next time!
(Update 10.3.21 – Big thanks to Vadim Epstein who reached out to me to clarify some things about Aphantasia. I’ve rewritten that section and also added information about Illustra, his new Text-to-image notebook)