OpenAI’s latest strange yet fascinating creation is DALL-E, which by way of hasty summary might be called “GPT-3 for images.” It creates illustrations, photos, renders or whatever method you prefer, of anything you can intelligibly describe, from “a cat wearing a bow tie” to “a daikon radish in a tutu walking a dog.” But don’t write stock photography and illustration’s obituaries just yet.
As usual, OpenAI’s description of its invention is quite readable and not overly technical. But it bears a bit of contextualizing.
What researchers created with GPT-3 was an AI that, given a prompt, would attempt to generate a plausible version of what it describes. So if you say “a story about a child who finds a witch in the woods,” it will try to write one — and if you hit the button again, it will write it again, differently. And again, and again, and again.
Some of these attempts will be better than others; indeed, some will be barely coherent while others may be nearly indistinguishable from something written by a human. But it doesn’t output garbage or serious grammatical errors, which makes it suitable for a variety of tasks, as startups and researchers are exploring right now.
DALL-E (a combination of Dali and WALL-E) takes this concept one further. Turning text into images has been done for years by AI agents, with varying but steadily increasing success. In this case the agent uses the language understanding and context provided by GPT-3 and its underlying structure to create a plausible image that matches a prompt.
As OpenAI puts it:
GPT-3 showed that language can be used to instruct a large neural network to perform a variety of text generation tasks. Image GPT showed that the same type of neural network can also be used to generate images with high fidelity. We extend these findings to show that manipulating visual concepts through language is now within reach.
What they mean is that an image generator of this type can be manipulated naturally, simply by telling it what to do. Sure, you could dig into its guts and find the token that represents color, and decode its pathways so you can activate and change them, the way you might stimulate the neurons of a real brain. But you wouldn’t do that when asking your staff illustrator to make something blue rather than green. You just say, “a blue car” instead of “a green car” and they get it.
So it is with DALL-E, which understands these prompts and rarely fails in any serious way, although it must be said that even when looking at the best of a hundred or a thousand attempts, many images it generates are more than a little… off. Of which later.
In the OpenAI post, the researchers give copious interactive examples of how the system can be told to do minor variations of the same idea, and the result is plausible and often quite good. The truth is these systems can be very fragile, as they admit DALL-E is in some ways, and saying “a green leather purse shaped like a pentagon” may produce what’s expected but “a blue suede purse shaped like a pentagon” might produce nightmare fuel. Why? It’s hard to say, given the black-box nature of these systems.
But DALL-E is remarkably robust to such changes, and reliably produces pretty much whatever you ask for. A torus of guacamole, a sphere of zebra; a large blue block sitting on a small red block; a front view of a happy capybara, an isometric view of a sad capybara; and so on and so forth. You can play with all the examples at the post.
It also exhibited some unintended but useful behaviors, using intuitive logic to understand requests like asking it to make multiple sketches of the same (non-existent) cat, with the original on top and the sketch on the bottom. No special coding here: “We did not anticipate that this capability would emerge, and made no modifications to the neural network or training procedure to encourage it.” This is fine.
Interestingly, another new system from OpenAI, CLIP, was used in conjunction with DALL-E to understand and rank the images in question, though it’s a little more technical and harder to understand. You can read about CLIP here.
The implications of this capability are many and various, so much so that I won’t attempt to go into them here. Even OpenAI punts:
In the future, we plan to analyze how models like DALL·E relate to societal issues like economic impact on certain work processes and professions, the potential for bias in the model outputs, and the longer term ethical challenges implied by this technology.
Right now, like GPT-3, this technology is amazing and yet difficult to make clear predictions regarding.
Notably, very little of what it produces seems truly “final” — that is to say, I couldn’t tell it to make a lead image for anything I’ve written lately and expect it to put out something I could use without modification. Even a brief inspection reveals all kinds of AI weirdness (Janelle Shane’s specialty), and while these rough edges will certainly be buffed off in time, it’s far from safe, the way GPT-3 text can’t just be sent out unedited in place of human writing.
It helps to generate many and pick the top few, as the following collection shows:
That’s not to detract from OpenAI’s accomplishment here. This is fabulously interesting and powerful work, and like the company’s other projects it will no doubt develop into something even more fabulous and interesting before long.
Powered by WPeMatico