This article is about the technology behind platforms like Dalle-2 and Midjourney, and why the creators Open AI should potentially be paying you money – not charging you …
More and more people on the internet are naming Dalle-2 and Open AI a scam. The reason is Dalle-2 now suddenly turning into a monetized service, where you need to purchase credits, if you use the platform beyond the beta limit.
DALLE 2 is just one of many new platforms offering you access to AI-generated content, and claiming that you can use it for commercial purposes. Other platforms include Midjourney, Jasper Art, Nightcafe, Starry AI and Craiyon. We will focus on Dalle 2 in this blog post, but they are almost identifical, when it comes to the legal challenges and problems.
Scam is a pretty harsh statement in our opinion, but there is an obvious problem in using data that other people have created (photos, videos, annotations, people on the images etc.) and then beginning to sell it back to the same people.
This problem may be overlooked by many of us, because we’re simply fascinated by the new technology. Something that is totally understandable.
However, even though DALL-E 2 at the end of the day is only an advanced pattern recognition machine, it’s output is not neutral, and the patterns don’t come from fresh air.
They are based on tons of data, where there are multiple legal questions to be asked. Questions that are important for you as a potential user of the images that you generate.
Image created by DALLE-2
AI-models can’t be compared to human beings
You should start by reading this brilliant article in Engadget, before you begin considering using DALL-E 2 images for commercial purposes.
In the Engadget article they point out another very important thing. Namely the fact that DALL-E 2 and OpenAI are NOT relinquishing their own right to commercialize images that users create using DALL-E. Basically meaning that you can generate images that they will then sell commercially to others.
This shows that the intentions are very different than the analogy sometimes used, where DALLE-2 promoters will compare it to a student reading the work of an established author. In this example the student may learn the author’s styles and patterns and later find these applicable in other contexts and re-use them there.
However, this is not about a human brain using creative memory to create new creative works. This is about a pattern recognition machine reusing and in some cases reproducing training data in images that are then used or even sold commercially. It’s simply two different worlds – both metaphorically and literally speaking.
Real photo from the real world
JumpStory’s Authenticity Promise
This article is for the people who want to understand on a deeper level, how this new AI image generation technology works. But before we get started just a few words on why JumpStory are not currently building a similar machine.
Of course, we have been asked that question multiple times. Not least considering that we’re already using AI in our company, and since we have access to millions of authentic images.
However, this is not a technological discussion for us, but an ethical one. A discussion that has resulted in our Authenticity Promise.
We are fundamentally against a future, where AI-generated images become the norm rather than the exception. Call us old-fashioned, but we believe that the REAL world is beautiful.
We’re proud that our photos & videos portray real human beings in different shapes and sizes. We’re not against the use of AI, but we don’t think it should be used to generate fake people or realities.
Technologies such as synthetic media and DALL-E 2 may be fascinating on the surface, but they pose a real risk too. They risk of blurring the lines between real and fake, which will be a fundamental threat to the trust between human beings.
This is why JumpStory doesn’t use artificial intelligence to generate fake images, but instead use AI to identify which images are original, authentic and – of course – legal to use for commercial purposes.
These are the images that you find using our service, and we’ve named our approach ‘Authentic Intelligence’.
Understanding how AI images are generated
Enough about JumpStory and the legal issues with DALL-E 2 for now. Let us look at how AI images are generated on platforms like DALLE-2, Imagen, Crayion (formerly Dall-E Mini), Midjourney etc. … Using DALLE-2 as the most hyped example currently.
To begin with DALLE-2 can perform different kinds of tasks, but we will focus on the task of image generation in this blogpost.
How it works is that a text prompt is input into a text encoder. This encoder is trained to map the prompt to a representation space. Afterwards, a so called prior-model maps the encoded text to a corresponding image encoding that captures the semantic information of the text encoding prompt.
(If this is already becoming a bit geeky, I’m very sorry, but it will get even worse 😊)
The final step for the image encoder is to generate an image that visualizes the semantic information that the encoder received. This is the basics of machines like Open AI.
The relationship between text and visuals
DALL-E 2 and similar technologies are often referred to as text-to-image generators. The reason is their ability to receive a text input and deliver an image output.
To give you an example this is “An astronaut riding a horse in the style of Andy Warhol:
What happens here is based on Open AI’s model named CLIP. CLIP is short for “Contrastive Language-Image Pre-training” and is a very complex model trained on millions of images and captions.
What CLIP is especially good at is understanding how much a particular text relates to a particular image. The key here is not the caption, but how related a certain caption is to a certain image.
This kind of technology is named ‘contrastive’, and what CLIP is able to do is to learn semantics from natural language. The way that CLIP has learned this is through a process, where the objective is to (now quoting the technological documentation): “simultaneously maximize the cosine similaritybetween Ncorrect encoded image/caption pairs and minimize the cosine similarity between N2 – N incorrect encoded image/caption pairs.”
Generating the images
As described above, the CLIP model learns a representation space in which it can determine, how the encodings of images and texts are related.
The next task is to use this space to generate images. For this purpose Open AI has developed another model named GLIDE, which is able to use the input from CLIP and – using a diffusion model – perform the image generation.
To briefly explain what a diffusion model is, it’s basically a model that learns to generate data by reversing a gradual noising process. Sorry for this now becoming very technical, so to quote a description found in the Open AI documentation:
“The noising process is viewed as a parameterized Markov chain that gradually adds noise to an image to corrupt it, eventually (asymptotically) resulting in pure Gaussian noise. The Diffusion Model learns to navigate backwards along this chain, gradually removing the noise over a series of timesteps to reverse this process.”
If you want to go even deeper into the technology, we recommend reading this excellent article by Ryan O’Connor.
About the author
He is one of Denmark’s most well-known entrepreneurs and business authors. He has been nominated as Entrepreneur of the Year and is amongst Denmark’s 100 most promising leaders according to a major Danish business newspaper.
In addition to being a serial entrepreneur, Jonathan Løw is the former Head of Marketing at the KaosPilots – named Top 10 most innovative business schools in the world by FastCompany. He is also former Startup-Advisor and Investor at Accelerace – the leading investment fund for startups in Denmark.