I recently created a new way to generate AI art that does not directly use or copy artists work to generate images and is an exploration in how to visually enable large language models (LLMs).
Click this link to try it out and see what you can draw and get a sense of what the app is like.
How Can an LLM Know About the Visual World?
I was interested in how ChatGPT was able to understand the visual world despite being an AI that is only trained on text and words. It does not use any images, how does it know what things look like?
How can an AI that has never seen an image, had no images in it’s training set, and cannot output an image know what the visual world looks like?
I spent a few days puzzling over this and came up with a solution that I think is pretty cool and offers a nice proof that LLMs can become visually enabled.
DrawGPT – An Exploration in Visually Enabled LLMs
After thinking about how to get an AI LLM to render images I decided instead of just a proof of concept I would try to create an entire application that would showcase exactly how this could be done.
You can see it here at this link DrawGPT.
How Can an LLM Become Visually Enabled to Generate Pictures and Images?
The first step in creating a visually enabled LLM is of course the training data.
In my experience with ChatGPT I found that it was highly likely OpenAI had in fact use CLIP or CLIP-like data in their training data for GPT-3. It would be very difficult for a large language model to have an understanding of visual objects, their color, relative visual compositions of an objects, and everything else based on purely textual information alone.
While I cannot prove definitively this is true it seems likely given OpenAI’s products like DALL-E.
There is certainly a lot of visual information in large language model training sets that use only text. Paintings like the Mona Lisa are discussed in depth in art reviews, basic anatomical structures of things like animals are discussed in biology textbooks, things like buildings and skylines and landscapes are written about endlessly in literature. But I do not believe that would be enough to enable an LLM to become visually enabled in a way that would consistently output correct visual imagery.
CLIP, (an AI program that can take an image as an input and create a text description of that image), is a tool that can take visual text descriptions to the next level. By breaking down a visual image in to distinct text tokens CLIP and CLIP-like data creates a direct set of tokens related to visual imagery.
We know CLIP data works very well for creating AI art and generating images with AI because things like Stable Diffusion and Midjourney and DALL-E all use CLIP or CLIP-like data to generate images. This hinted me towards a direction for DrawGPT.
Text Tokens, Pixel Data, and Diffusion, Oh My!
Most of the AI art tools we see right now (Jan 2023) are based on a combination of CLIP data to create text tokens and latent pixel diffusion. This is what allows “text to image” AI art.
In order to be able to create “any” image these pixel diffusers need to be trained on copious amounts of images which get their subject matter extracted either by metadata provided in the training set or by running images in the training set through CLIP and using the output alongside the image.
What is going on behind the scenes with text inputs to pixel diffusion is that the text tokens are actually parsed to create the sampling distribution for the pixel diffusion. It breaks down the text phrase you sent as an input and then starts sampling random pixels based on the text tokens and the more times it can go through and take guesses as to what pixel goes where the better the output image is.
This is a phenomenal way to create AI art and it is very effective. But it also has some major issues.
The major problem with things like DALL-E and Stable Diffusion is that the image sets they were trained on did not necessarily credit the artists properly. Things like the artists style, the subject matter, the image composition, and many more things were extracted during the training using CLIP or available metadata.
And we’re not talking about a few images here. We’re talking millions of images scraped from the Internet and possible from sources that did not even know they were being scraped. Yes technically the terms of service were not broken during the collection of the images for the training set but obviously the resulting backlash suggests that the image collection was in an ethical gray zone.
As we’ve seen online there are many artists who are not happy with the way their work is being used in these AI art tools.
This is a major issue and it is something that I thought I could also uniquely address with DrawGPT by using ONLY an LLM… no actual pixel data. An LLM cannot copy anything about an artists work directly because it is not sampling or reading the pixel data of the images, only the text descriptions of them from CLIP data.
DrawGPT – Part of the Solution to Potential Art Theft & Ethical Dubiousness
One way to easily get around the issue of artists not feeling that their work was being copied is simply to not copy it.
That seems simple enough on the surface but in practice has not really been realistic. With the introduction of genuinely large LLMs like GPT-3, GPT-3 DaVinci, ChatGPT, Bloom, and others the total corpus of textual works in the training set, including any CLIP data, should be proficient to give enough visual references for an LLM to be able to create images simply from words.
The problem is that the LLMs are not trained to create images. They are trained to create text. And while they can be trained to create images they are not trained to create images in a way that is visually coherent.
And that is where the question of how a visually enabled LLM is able to express itself. While it may know what a dog is, it may not know what a dog looks like. It may know what a dog is & it may know what a dog looks like from written examples how would it draw given that it cannot output pixel data?
How Can An AI LLM Draw?
This was my first question. Because the field of AI research with these LLMs, transformers, and diffusers is so new it wasn’t really something AI researchers were looking at. I did not have a lot of work to reference as no one had really been considering how to get the LLM itself to draw.
Much like the need for a truly massive training set the LLMs themselves needed to reach a certain maturity before it was realistic to explore for some research.
Even if the AI LLM has enough visual reference data it also requires an AI LLM with sufficiently large corpus of training data on an output medium to enable the ability to output tokens correctly enough that images could be rendered.
With the introduction of GPT-3 and the checkpoint GPT-3 DaVinci we have reached a point where the AI can in fact command a visual medium with enough complexity to correctly render images.
What is the medium for an LLM? Well, seeing how it can only use text it needs the text that it outputs to create an image. Since the images are digital this means the LLM needs to output instructions to draw a digital image.
This leaves only a few options for visual, artistic mediums for an LLM:
- SVG – an XML based plaintext text format for web enabled vector images.
- HTML – Using the HTML5 canvas tag with Javascript draw commands. It’s well supported in all browsers now.
- LaTeX – A way to express complex equations which can draw lines but is not very suited for visual work.
- ASCII – Using text characters to create a visual image by using each character as a “pixel”.
Of these options the only realistic choices are SVG and HTML5 canvas. LaTeX is not really suited for visual work and ASCII is not really suited for actual drawing (it’s great for CLI output or things like comments in web3 smart contracts).
SVGGPT ??? Nope.
SVG turned out to be a little too complex and verbose. It’s a very powerful format but the additional characters it uses with the XML spec + all of the attributes ended up being very difficult to create an image with.
While SVG does work, and it was the first format I tried because it seemed ideal, there were some major issues. Notably limits on output tokens often resulted in partial SVG drawings and without sufficient closing tags for open tags it just wasn’t possible to consistently generate complete images even on a basic level.
HTML5 Canvas GPT ??? Yep.
It turned out that using the 2D context of an HTML5 canvas tag with draw commands in Javascript was the perfect way to draw basics images with an LLM.
Using a very complex prompt that limits the output to only the relevant code I was able to consistently get DrawGPT to output code that would draw images. You are able to see the Javascript draw commands on DrawGPT when you create an image. Give it a try! All the Javacript code for any image is currently open source on the website.
2D canvas context draw commands in Javascript are not really for drawing complex, detailed images. They are more of the standard draw commands you see in most low level visual systems. The commands are things like fill, rect, line, arc, etc. They are not really meant for drawing complex images but they are perfect for drawing basic images.
This is why most of the output of DrawGPT is not detailed imagery like you expect from Stable Diffusion, DALL-E or any of the latent pixel diffusion methods used by other AI art models.
While it would be possible to draw more detailed images using an LLM + Javascript draw commands given the output token limit of the GTP-3 AI calls it is just not feasible for this particular proof of concept.
To note: if the prompt is changed to ask for more detailed images, or more detailed pixel art, then the AI LLM models will attempt to draw more detailed images. But the output will be limited by the output token limit of the GPT-3 API calls.
How Can We Know An LLM Is Drawing Things Correctly?
Once I was able to get the LLM to consistently render images the question became, “Is it drawing things correctly?” There was some difficult at first with more complex scenes or complex objects as it wasn’t clear exactly what the AI was drawing. Are those dots in the sky birds or are they just noise and artifacts like traditional pixel diffusion methods often produce?
It’s easy to see when DALL-E or Stable Diffusion create an image and the tokens are correctly represented but sometimes it’s not so obvious with a simplified image.
One massive advantage of using an LLM for drawing is that you can simply have it tell you what each object is supposed to be. This isn’t really an option with most of the other AI art methods as they are not trained to output text alongside the image perfectly describing each feature or token in the output image. You can always run the output image through CLIP but that does not give insight in to the actual drawing process or specifically what each object should be.
By forcing the output to include relevant code comments in the Javascript (you can see them in the code on the page) I was able to get the LLM to reveal the various objects it was attempting to draw.
I was surprised.
Not only was the LLM (default OpenAI GPT-3 DaVinci) now creating images I was able to verify that the things it was drawing were correct.
DrawGPT Draws Really Well, It Knows What It Is Drawing
It was stunning to see the AI generated images coming out consistently & correctly.
What do I mean by that? For example:
- Portraits – Things like hair, eyes, nose, ears, mouth are all in the correct places. It draws those things “inside” a circle it will draw for a head and they will be correctly ordered vertically (the eyes are never below the mouth)
- Landscapes – Mountains, sunsets, birds in the sky, clouds, trees, etc. are all in the correct place. It never tries to put the ground above the sky or have mountains strangely floating in space.
- Objects – It knows the basic layout of common but complex objects like bicycles, lamps, and many others things. While it cannot draw a fully perfect bicyle the image it renders features the basic elements in the correct places.
- Animals – It understands the basic layout of animals, including the number of legs, relevant things like ears or fins and attempts to place them correctly. A great sample is the image used for the DrawGPT AI Art Twitter Bot image. You can clearly see it was trying to draw a bird.
Regardless of this used CLIP data the reality is that the LLM is drawing things correctly.
It is not just drawing random things in random places on the image. It does have some issues with relative scaling but it is hardly ever so bad that the image itself is not recognizeable.
It is also drawing things in the correct order. It will draw the ground before the sky, the sky before the clouds, the clouds before the sun, the sun before the mountains, the mountains before the trees, the trees before the birds, etc.
In addition to drawing concrete objects it is also able to draw things like abstract shapes and patterns. It is not perfect but it is able to draw things like circles, squares, triangles, and other basic shapes. It is also able to draw things like stripes, polka dots, and other patterns.
It will use loops, if statements, and other basic programming constructs to draw things like a grid of squares, a pattern of circles, birds in the sky, and fruit on trees.
Sometimes the LLM chooses to express itself with text as well. It is able to use the text commands to label things or make statements within the image itself.
One truly surprising thing was when I send in no subject to draw at all. The AI will just draw something totally random: portraits, fine art, landscapes, and of course it’s all time favorite the Mona Lisa.
It loves to draw the Mona Lisa.
DrawGPT Is Not Perfect
If you use the app you’ll see that yes, the images are very simplistic. They are sometimes difficult to tell visually what you are looking at because it is just a series of boxes and circles.
Portraits will occasionally be unrecognizeable as it will pick similar colors for some things and make the image a mess. I believe that issue could likely be solved very easily with a better model or more specific training data designed to allow better visual responses.
The LLM is not perfect but it is drawing things correctly. If you reference the comments in the code it becomes clear that the concepts and tokens in the image are correct even if it is limited by the simplicity of the medium it has to use.
This is mostly a tradeoff of using simple draw commands in only text to draw images and rarely the issue with the actual output tokens of the AI.
DrawGPT – Adding Some Character + An Impish Twitter Bot
For fun I have the prompt adjust the comments in the code to add a little flavor to the output, often including a humorous take on the prompt or subject matter.
This was important because it gives the images and the output and the entire AI a feeling of being a character that you are interacting with. This is similar to the way people feel they are speaking conversationally with ChatGPT and it incredibly important for interacting with AI.
Seeing as how DrawGPT was able to draw things correctly & provide a little flavor, character, and humor I decided to create a Twitter bot that would allow users to reply to a tweet and have DrawGPT reply with an image. This also allowed me to experiment with incredibly complex input prompts that I would have otherwise not thought of on my own.
If you’d like to use the DrawGPT Twitter bot you can reply to any tweet with “@DrawGPT draw” and it will respond with an image of the tweet you are replying to and include a link to the image on the website so you can see the code & comments as well as share the link.
DrawGPT – A New Way To Create AI Art
DrawGPT will likely never be a commercial hit. The art is too simplistic to appeal to most people and the output tokens are too limited to be useful for most image generation tasks.
At the same time the simplicity of the images, combined with the LLM drawing important features of the subject, often creates a sort of “caricature” of the subject. For example if you have it draw Trump it will almost always try to draw some sort of hair.
It’s a really fun thing & the creativity of the AI LLM and how it draws is pretty mind blowing. It’s also a great way to get a glimpse in to how the AI is thinking.
DrawGPT – The Code & The Images & The Prompt & License
DrawGPT currently uses the stock OpenAI GPT-3 DaVinci model. There are no additional fine tuning or additional training sets added.
At this time I will not be releasing the prompt I am using.
I do list on the website the prompt tokens & the output tokens as returned so users and researchers can get a feeling for what the prompt may be like.
All of the code and images on the website generated by DrawGPT are currently under the CC0 license. This may change some day but the intent is provide an open source & fun project that publicly showcases the concepts for users and AI researchers.
What Is Next For AI Art and DrawGPT?
The front facing portion of every AI that interacts with humans is a language model.
As humans we express ourselves through language. Regardless of if the AI is an LLM or if it is something like Stable Diffusion, Disco, DALL-E, VQGAN, POINT-E, or any other AI we as humans still have to instruct it with language.
At this time I do not have any huge plans for DrawGPT. I may attempt to introduce other LLMs as a sort of litmus test for how visually enabled they are and I will certainly be giving it a spin with GPT-4 when it comes out.
I chose to output the image in 512×512 pixels, the size expected of most img2img inputs for other models so that the outputs can be used as inputs to more complex AI art models so it is fully compatiable with things like Stable Diffusion.
I am extrememly pleased with the way DrawGPT turned out.
I think that I have conceptually proved a few things and hopefully other AI researchers in the future can build with some of the fundamentals & tips & tricks I explored:
- Visually enable LLMs by including CLIP data in the language training set.
- LLM must also have sufficient training on the output medium.
- Use the visual output to correctly identify if the AI and large language model “understands” complex visual concepts.
- Include code comments or metadata of tokens in the output linked to specific parts of the image to identify if the drawing is “correct”.
- Give the AI character and flavor to make it fun to interact with.
- Enable the use of crowdsourced or social inputs to explore complex inputs you would not normally think of yourself.
Did You Write This With AI?
No. The horrendous spelling mistakes and terrible grammar are my own. I’m a programmer, not an English teacher.
Did You Really Not Click the Link Yet?
If you have somehow made it this far in to the article without clicking, now is the time.
Click here to try out DrawGPT and draw your own images with AI and generate art with an AI that only knows written words and has never seen a pixel in its life.