ChatGPT vs Gemini Native Image Generation: Who Does It Better?

Google and OpenAI are competing head-to-head to deliver the best native image generation model. After Google introduced native image generation in Gemini, OpenAI didn’t waste time and added support for native image output for all ChatGPT users. So, to find out which AI model delivers better results, I have compared native image generation in OpenAI’s ChatGPT and Google Gemini. I have tested the models for character consistency, text rendering, instruction adherence, and more in this post.

1. Turn Yourself into an Anime Character

I started the native image generation comparison between ChatGPT and Gemini by prompting both models to create an anime-style image. As you can see in the results below, ChatGPT 4o hit it out of the park and generated the image in classic Studio Ghibli style in one go. On the other hand, I tried multiple prompts on Gemini, but the native image generation model couldn’t create an anime-style image at all.

2. Whiteboard Session

In the next test, I asked ChatGPT and Gemini to create an image showing a man explaining the concept of relativity. Thanks to the larger ChatGPT 4o model, ChatGPT produced a great image with legible handwritten text. It even captured the photographer in the reflection.

However, the smaller Gemini 2.0 Flash model struggled to get the text right on the whiteboard. While Gemini successfully added “Beebom” to the man’s t-shirt, it didn’t capture the photographer’s reflection. That said, the man in Gemini’s output looks more authentic compared to ChatGPT’s output.

This is the best example to showcase the difference between ChatGPT and Gemini in native image generation. ChatGPT designed a beautiful menu card with perfect text rendering. It missed out on the last dish, but it followed my instructions pretty well. That said, Gemini starts to hallucinate if you throw dense information in your prompt. It has got nearly all the text wrong, with jumbled words.

4. Create an Infographic

Following that, I asked ChatGPT and Gemini to create an infographic to explain the concept of gravity, featuring Newton as the character. It goes without saying that ChatGPT did a splendid job, both in terms of design and explaining the concept in clear, readable text.

The result is so good that ChatGPT’s native image generation feature can be used to create comic strips, educational books, visual guides, and more.

On the other hand, Gemini has been disappointing with its result. The text and visuals don’t make any sense. One thing to note is that Gemini 2.0 Flash generates an image within 3 to 4 seconds, while ChatGPT takes more than a minute to produce a single image. ChatGPT is using the larger 4o model, which uses a lot of processing power, leading to a far more coherent result.

5. Restyle Images

Coming to restyling images, I uploaded an image of a cactus plant in a garden and prompted both models to add some colorful flowers. In my testing, I found that ChatGPT goes overboard with each refinement. It entirely changed the look of the image after each modification. In contrast, the Gemini model maintained the consistency across multiple generations.

While ChatGPT 4o is natively multimodal (built on an auto-regressive architecture), some experts believe that the native image generation feature uses a Diffusion-based decoder. While it helps in accurately rendering text, it also regenerates the image on each iteration.

So it’s not a pure auto-regressive model like Gemini 2.0 Flash, hence, the difference in image output after each modification.

6. Blend Images Together

Next, I uploaded two images and asked ChatGPT and Gemini both to create an image of the woman holding the mug. Both models delivered impressive results. In fact, Gemini was a bit more creative and changed the posture as well. That said, OpenAI says ChatGPT 4o can handle up to 20 images in one prompt and leverages in-context learning to create a single, unified image.

7. Change the Point of View

In the next test, I uploaded an image of a hallway and prompted ChatGPT and Gemini to change the point of view. Both models delivered almost similar results, but ChatGPT was closer to the original image. Gemini hallucinated and added an extra leg to the armchair. Overall, I will give this round to ChatGPT since it mirrored the opposite view more accurately.

8. A Wall Clock Showing 6:30 Time

Finally, in the last test, both ChatGPT and Gemini failed to correctly render the specified time (6:30) on the wall clock. It’s a recurring issue in AI image generation, as models tend to default to 10:10 due to biases in the training dataset. So, even with native image generation, OpenAI and Google have not been able to overcome this constraint in instruction following.

Conclusion: ChatGPT vs Gemini Native Image Generation

After running a range of tests, I can confidently say that ChatGPT’s native image generation is currently more advanced than Gemini 2.0 Flash. It’s powered by the larger ChatGPT 4o model, which has broader world knowledge. This results in more coherent images. It perfectly renders text and follows instructions with impressive precision.

In contrast, Google’s experimental Gemini 2.0 Flash model is smaller, which results in faster performance. However, it often hallucinates while rendering dense text, and the results are of lower quality.

What makes Gemini stand out is that it maintains consistency after each generation, which is a big advantage. We should wait for native image output support on the newly-released Gemini 2.5 Pro model which is expected to deliver exceptional performance in native image generation.

Arjun Sha

Passionate about Windows, ChromeOS, Android, security and privacy issues. Have a penchant to solve everyday computing problems.



Source link
Exit mobile version