Structures in the fog
20 Jul 2023
AI tools such as the Stable Diffusion model developed by Björn Ommer are teaching computers how to see and paint. A portrait from our magazine EINSICHTEN.
20 Jul 2023
AI tools such as the Stable Diffusion model developed by Björn Ommer are teaching computers how to see and paint. A portrait from our magazine EINSICHTEN.
In October 2018, the auction house Christie’s offered a portrait purporting to depict a certain Edmond de Belamy. The picture shows the face of an unknown young man – Christie’s describes him as a “gentleman, possibly … a man of the church” – who might have lived around the middle of the 19th or the start of the 20th century. However, these attributions about the person in the picture with the blurry facial features beneath a high forehead are just suppositions based on art-historical classification of the painting style. You see, the painting is not meant to be an accurate naturalistic representation of a person, but is steeped in the avant-garde atmosphere of the early modernist period. In other words, it is more of an esthetic impression than a likeness – an expression of artistic freedom, you might say, which sold for 432,500 dollars on its auction day in 2018.
But whatever of artistic freedom, we cannot speak of the freedom of the artist here. Because there is no artist.
In the bottom right corner of the picture, where masters usually sign their works, the following signature has been appended to the image: “min G max D x[log (D(x))] + z [log (1 –D (G(z)))]” – a fragment of the computer code that generated the picture. Thus, the creator is a computer, or artificial intelligence (AI) to be precise.
Although in another sense, perhaps it is not the author of the work. Because somebody must have programmed the networked computers that are behind the AI in such a way that they could paint this “beautiful friend” (“bel ami” in French, or “Belamy” as the imaginary subject of the portrait is surnamed). But in this case, nobody did that either! The algorithm responsible for the portrait was produced in tandem between two self-learning neural networks, which contest with each other autonomously. This kind of framework is called a generative adversarial network (GAN). One half of this network, called a generator, was fed data from 15,000 real artistic portraits which were created between the 14th and 20th centuries.
Based on this dataset, the generator produced completely new pictures, which were submitted to a so-called “discriminator” for evaluation. It was the job here of the generator to successfully pass the images off as human-made and of the discriminator to call its bluff. Thousands of images were created in this manner, giving rise to a whole family tree, a fictional dynasty of the Belamy clan; with a Count and Countess Belamy, a baroness, and an archbishop – a synthetic genealogy that owes its existence to a game of one-upmanship between generator and discriminator. The results have shaken the art world.
For if intelligent computer networks, using deep learning and with just databases of art history to inspire them, are capable of creating autonomous new works, how can we distinguish their productions from those of human artists? Must we recognize the machine works as creative? The Obvious collective, which put the Portrait of Edmond de Belamy up for auction (and collected the half a million dollars – are they the artists?), thus surely had their tongues in their cheeks when they subtitled the portrait of Edmond with the mock quote: “The shadows of the demons of complexity awaken by my family are haunting me.”
Almost five years have passed since then, and now anyone can get synthetic image generators to produce images with simple text commands: DALL-E 2 and Midjourney are just two of the hot names in the field right now. Moreover, the rapid progress of AI is not limited to fine art, but schoolchildren and students can now get ChatGPT to write their essays for them.
Nobody is better placed to rein in the supposedly unfettered genies of AI and to calmly enumerate which research problems have already been mastered and which remain unsolved than LMU Professor Björn Ommer. Chair of AI for Computer Vision and Digital Humanities / the Arts, he also heads the Computer Vision & Learning Group (CompVis), which developed its own image generation AI – the above-mentioned Stable Diffusion. He researches computer-based understanding and AI image generation using machine learning methods.
Ommer’s work in the field of computer vision involves empowering computers to interpret visual information and generate completely new images from it based on their “understanding.” In a sense, then, Björn Ommer teaches computers how to see and paint – and does so such that they transform text inputs expressed in natural language into new images. He follows a reverse diffusion approach whereby networks add Gaussian noise to data and then learn how to remove noise from the images. This process can be likened to finding structures in the fog. Over time, this allows them to produce ever more plausible and detailed images.
The images that come out are not copies, but persuasive inventions. “Stable Diffusion,” says Ommer, “is a generative AI system that allows users to simply type into the computer what they would like the picture to be like. And the computer converts the text into an image.” His team has thus developed a system that generates images – and in the future, videos and 3D models as well – from the input of a user who merely describes what should be in the image or video. “Now I have the ability to say to the computer: ‘Please create and edit a picture just as I want it,’” says Ommer. There is a certain irony to the location of Ommer’s office directly opposite the Academy of Fine Arts on Munich’s Akademiestraße. New art works are created on both sides of the street, even if in Ommer’s office they are calculated according to a probability model and not conceived in a burst of inspiration like across the road. Or maybe they are?
Ommer is conscious of the implications of the technology his team has developed. After all, not only have they transformed the menial computing machine into a creative tool, but in plumbing the mysteries of image production they are also delving into the process of our human perception of the world and thus our very understanding of reality. “This approach proves the capabilities of the system, which it shares with humans. Of course, the processes are different than they are in humans. But it’s equally clear that when human capabilities are replicated in computers, then we acquire a better understanding of the structures of human intelligence.”
I’d like to have a machine that does what I say, but that is not self-determining. This is where I see the creativity and artistic nature of humans remaining untouched.Prof. Björn Ommer
Various image-generating methods using artificial intelligence have developed along different paths. While the GANs à la Belamy pit adversarial neural networks against each other, diffusion models like the one created by Björn Ommer’s team start out by corrupting huge sets of training data. Incorrect information (noise) is added incrementally to this image data until the image is completely unrecognizable. “For us humans, these minimal disruptions are imperceptible at first. But when an image has undergone this process hundreds or thousands of times in succession, then the result is something that looks like the fuzz you saw when you unplugged an old TV set.”
His neural networks are then trained, using probability models, to reverse this process – that is to say, to undo the destruction of the image by gradually removing the noise. Finally, a text encoder is put to work on the noisy database, out of which the decoder of the diffusion model generates the new, synthetic images in a sequence of denoising steps.
What differentiates Ommer’s Stable Diffusion from its well-known competitors (aside from the technical details) is that his team has developed an algorithm that learns a way of representing images which is so compact that it obviates the need for a computer cluster for implementation. It runs on standard consumer hardware and generates images in seconds. To do this, the essence of training data had to be abstracted in such a way that billions of training images fit inside a few gigabytes on people’s home computers. Only an AI that is compact enough to run on the conventional hardware of millions of users would permit the democratization of this technology, says Ommer.
In this spirit, Ommer’s team decided to make their AI available to anyone as open source (https://stablediffusionweb.com/#demo). They provide the source code (https://github.com/CompVis/stable-diffusion) and even ready-made apps for ordinary home computers, making it possible for users to run this AI system, unlike all others, completely independent of corporate constraints and interests.
Google and Meta (parent company of Facebook, Instagram, and WhatsApp) have not made their image and video generators openly accessible to the public. Although the image generator Dall-E 2 is publicly available, it remains under the control of the Californian company Open AI, which also developed ChatGPT and is financed by Microsoft among others. There are non-accessible restrictions and security filters – the model and training data are kept just as secret as the code that gets everything going.
Even if commercial considerations undoubtedly play a role here, the companies justify this secrecy in the first instance by pointing to the risk that the image generators could be used for pornography and fake news. “If you push this logic even further,” says Ommer, “it means only a small handful of tech companies would be able to pursue this research in the future.” This is because the systems of the big tech corporations are designed such that they require large server farms not only during training, but to run them as well. “This raises the question,” notes Ommer, “as to where research and application are headed if just a few firms possess the resources as well as the knowledge and the algorithms in order to effectively operate such systems and keep them running.” Moreover, the development of open-source software shows “that the quality of all developments benefits hugely from allowing as many smart minds as possible to conduct research into something and develop their own solutions building on open code.”
For Ommer’s work, therefore, this question is fundamental: “How can we build a powerful, effective tool that is freely accessible and runs on hardware that is affordable for everyone?” He has achieved this with Stable Diffusion. But how did he manage to compress billions of training data, hundreds of terabytes – you could almost say: the whole internet – such that ordinary home computers can use his image generator?
“Images consist of millions of pixels and any given pixel is not that important. When we observe paintings, we expect shadows or reflections to be done properly. But how each individual hair on a head, each blade of grass on a meadow are oriented is not of any great concern. It matters much more that the hairs are in the right place and that their color and length match. That was our approach: We compress images to retain their essence.” This sets the system apart from others. It extracts this essence so that the diffusion process can subsequently run in a much compressed representation. “The local details, which hardly anyone takes notice of, are then simply hallucinated into the picture, stochastically invented as it were.” After all, even human artists are more interested in the overall representation of a meadow than in the growth of every single blade of grass,” says Ommer. “This allows us to shed hundreds of terabytes.”
And yet weighty, troubling questions remain unanswered. Does artificial creativity deliver the final blow to the narcissistic human presumption to being unique? After the humbling visited on human conceit by the discoveries of Copernicus, Darwin, and Freud, must we now accept that our supposedly unique capability to create works of art was also a chimera? We have already had to cope with not being the center of the universe, but just an agglomeration of cosmic particles; with not being the pinnacle of creation, but just a product of evolution; and with our souls not being alone in their own house, as Freud put it, but being determined by unconscious drives. Is our creativity now destined to be overshadowed by computing power?
Other, less metaphysical questions also arise: Do artists, graphic designers, illustrators, and photographers have to fear for their jobs? Who actually has the copyright to computed artworks: the operators of the networks? Their programmers? The owners of the image databases on which the networks are trained? What is an ‘original’ when imaging techniques can generate pictures, illustrations, and photo-realistic scenes from a few prompts and ‘Belamys’ can be produced by anyone at any time at the push of a button? What will the artworld of the future look like if the ultimate fusion of everything with everything else is imminent, the actualization of the anything goes that postmodernism had promised in theory? And are the results actually worth half a million dollars per picture like the Belamy portrait?
Beuys once said that the human being is the real artist, because humans are the only self-determining beings and thus the sovereigns par excellence.Björn Ommer
This much we can say with certainty: Computers and networks are not legal persons and therefore cannot be the author of an original work and cannot be granted exclusive rights to “their” work. So when an AI system generates an artwork without human intervention, the question arises as to whether the AI developer owns the copyright to the computed works.
But then would we not also have to consider that the humans who created the billions of original images used by neural networks as training data might be able to assert rights to their own works in this new context? The law is currently unclear on this matter, and various jurisdictions in various countries can take different approaches.
Precisely because none of these legal questions can be answered satisfactorily at present, Ommer boils them down to pragmatics, to what can actually be said. “I don’t want to transform the computer into an automatic artist and make artists unemployed or anything like that. The goal is just to make computers a more powerful tool for implementing human creativity.”
In that case, however, have humans just delegated their creativity to machines, letting them do what people are unable to? Ommer replies: “Beuys once said that the human being is the real artist, because humans are the only self-determining beings and thus the sovereigns par excellence. That chimes strongly with my way of thinking. I’d like to have a machine that does what I say, but that is not self-determining. This is where I see the creativity and artistic nature of humans remaining untouched. There is still plenty of space in which we humans can view ourselves as special vis-à-vis the machine. Artificial intelligence will not push us off our throne.”
Text: Bernd Graff
Prof. Dr. Björn Ommer is Chair of AI for Computer Vision and Digital Humanities / the Arts at LMU and heads the Computer Vision & Learning Group (CompVis). Born in 1981, Ommer studied computer science with a minor in physics at the University of Bonn. Having completed a doctorate in computer science at ETH Zurich, he worked as a postdoc at the University of California, Berkeley. From 2009 Ommer was a professor at Heidelberg University, where was also co-director of the Interdisciplinary Center for Scientific Computing, before joining the faculty of LMU in 2021.
Read more articles from the current edition of "EINSICHTEN. Das Forschungsmagazin" in the online section and browse the issue archive.
Or subscribe to EINSICHTEN free of charge and never miss an issue again (in german).