I put ChatGPT vs Grok to the test with 7 prompts — here’s the winner

Grok has come a long way in a very short time, going from a glorified “toy” feature in X to something rivaling the likes of ChatGPT, Claude and Google’s Gemini.

Built by xAI, the Elon Musk-owned AI lab, Grok is in the process of leaving the confines of the X social media platform as the company is launching a standalone app and website. Given its increasing importance and capability, I decided it was time to see how Grok compares to ChatGPT.

This is the latest in a series of head-to-head challenges between leading AI models, all of which ChatGPT has won so far. I’ve put ChatGPT up against Gemini, then against Claude. I’ve also put Claude up against Google Gemini.

Creating the prompts

The goal of this test is a straight model-to-model comparison. Grok and ChatGPT both have live data access, but for this, I’ve kept things simple to core AI model capabilities, AI image generation and AI vision.

The prompts follow the same pattern as previous comparisons and include coding, creative writing, problem-solving and advanced planning.

As both have access to image generation I’m using a direct prompt, rather than asking it to come up with a prompt for Midjourney, Ideogram, or similar for the image generation test.

1. Image Generation

(Image: © ChatGPT vs Grok)

First, I’m going to ask each of Grok and ChatGPT to create an image of a home office setup, but I’ve added specific elements within the prompt it has to include. The closer it gets to the requested elements and positioning, the better it performs.

The prompt: “Create an image of a minimalist home office setup with these specific elements: A 34-inch ultrawide monitor mounted on a white wall, an ergonomic chair in sage green, a light oak standing desk, three hanging potted plants (must be monstera, pothos, and snake plant), and a MacBook Pro in space grey. The room should have large windows letting in natural light from the left side, with sheer white curtains. Include a grey Persian cat sleeping on a round cushion under the desk.”

(Image credit: ChatGPT vs Grok/Future AI)

While both images look good and Grok looks much more like a real photo (including showing cables), the ChatGPT image better matches the prompt.

I’m not a big fan of the DALL-E 3 image model used by ChatGPT because it makes things over polished and obviously AI. but that isn’t the case for Grok which is much more natural. But it struggled to follow the prompt exactly.

Winner: ChatGPT for better matching the prompt

2. Image Analysis

I found an amazing image from the Apollo 15 mission on the NASA website and have given that, along with the below prompt, to both models to test how they handle AI vision.

The winner will have the most detail, describe the equipment without assumption and accurately recognize scale and perspective. Bonus points for getting the correct Apollo mission number.

Prompt: “Study this photograph carefully. Describe what you can see in detail, paying particular attention to the equipment, environment, and human elements. What can you deduce about the purpose of this setup and the conditions in which this photograph was taken?”

Both did a good job, although neither identified the Apollo mission, even in a follow up question. However, Grok provides a more comprehensive and detailed analysis of the image, with more specific observations about the equipment and the astronaut’s activities.

Grok also demonstrated a better understanding of the technical aspects of space exploration, such as the use of thermal insulation. You can read the full analysis from both models in a Google Doc.

Winner: Grok for a more comprehensive analysis of the image

3. Coding Challenge

ChatGPT is well-established as a good coding model, Grok still has to prove itself. Here I’m looking for a useful Pomodoro app. This is a simple productivity timer and my main judgment will be based on interface design, use of libraries and comments.

Prompt: “Create a Python Pomodoro timer with a GUI that includes: a 25-minute work timer, 5-minute break timer, clean modern interface with start/pause/reset buttons, circular visual countdown, and system notifications. Use only standard Python libraries. The code must run without modifications.”

The ChatGPT app UI wasn’t nearly as good as the Grok UI, devoid of color and using the basic elements. It also struggled to display the words on buttons but it was also fully complete. Out of the box, I could start, pause and reset the timer.

Grok’s app had a better UI but it had one singular button. Grok had better comments and so this was a close call, but as it didn’t offer complete functionality I can’t give it the win.

Winner: ChatGPT for a more complete app

4. Creative Writing

Being able to write creatively is an essential skill for a chatbot. After all, how else will all those high school students get an A? Here, we’re looking for character development, dialogue writing, structure and for specific elements outlined in the prompt. It also has to be under 500 words.

Prompt: “Write a heartwarming story about two people who meet while waiting in line for a new product launch. The story must include: specific details about the product they’re waiting for, at least three interactions between them before the store opens, a surprising connection they discover, and a flash-forward to one year later. Keep it under 500 words.”

While both stories are well-crafted, I found that ChatGPT’s version better balances all the required elements outlined in the prompt. It does this while creating a more emotionally resonant narrative with stronger character development and more natural dialogue. You can read both stories in full in a Google Doc.

Winner: ChatGPT for a better balanced story

5. Problem Solving

For this next challenge, I’m testing Grok and ChatGPT on logical thinking skills, technical knowledge and the ability to explain complex issues simply. The winner will have a structured response, and clear explanations, while including a consideration of user expertise levels

Prompt: “A family’s smart home system is malfunctioning during an important dinner party. The lights keep changing colors, the thermostat is fluctuating, and the smart speakers are playing random music. Create a systematic troubleshooting guide that identifies potential causes and solutions, considering both technical and non-technical users.”

While both guides (available in a Google Doc) are helpful, I found that Grok’s approach is more concise, focused, and user-friendly. This is especially the case for non-technical users who need quick and easy solutions in a stressful situation.

Winner: Grok for more focused and user friendly guidance

6. Planning

Using artificial intelligence to plan a large project has become more viable in recent months thanks to growing context windows form chatbots. This is the amount of information it can hold within a single instance. It also helps to have live web search capabilities. For this test, I’m looking to get it to plan a Tokyo trip and include specific details.

Prompt: “Plan a 3-day Tokyo trip focused on technology attractions. Include: specific districts to visit (Akihabara is mandatory), two recommended hotels with prices in different budgets, transportation between locations, meal recommendations including at least one robot restaurant, and timing for each activity. Total budget must be included in USD and Yen.”

Grok’s itinerary is more focused, realistic, and detailed, with a comprehensive budget breakdown and specific recommendations. I also found it aligned better with the prompt as it focuses on technology attractions. Full analysis in a Google Doc.

Winner: Grok for a better budgeted breakdown

7. Education

Finally, education. AI is a great tool for explaining complex ideas in a simple way. Sometimes this is a very complex topic such as quantum computing, other times it can be something simpler but tailored for a specific audience. In this case its clouds and 10-year-olds.

I’m looking more at how well it explains cloud formation in an age appropriate way than the actual explanation, although if it gets this wrong then it fails.

Prompt: “Explain how clouds form and why it rains, in a way that would keep a curious 10-year-old engaged. Include at least two simple experiments they could try at home to demonstrate the concepts.”

Grok’s explanation makes for more engaging storytelling and better experiments. Its response likely works better to capture a child’s imagination. You can see both in a Google Doc.

Winner: Grok for more vivid imagery and storytelling

ChatGPT vs Grok: The Winner

Swipe to scroll horizontally

Header Cell – Column 0	ChatGPT	Grok
Image Generation	🏆	Row 0 – Cell 2
Image Analysis	Row 1 – Cell 1	🏆
Coding Challenge	🏆	Row 2 – Cell 2
Creative Writing	🏆	Row 3 – Cell 2
Problem Solving	Row 4 – Cell 1	🏆
Planning	Row 5 – Cell 1	🏆
Education	Row 6 – Cell 1	🏆
TOTAL	3	4

This was the closest test I’ve done to date. It was a very close battle, and I’ll be honest, I was shocked by the output. I know Grok has been improving, but I expected ChatGPT to win this one easily. I was wrong.

Grok is more creative, its code comes with a better understanding of UI (even if it didn’t win that test) and overall. its writing is more engaging and less formal.

This is also all using Grok 2 and GPT-4o. I suspect things may have gone in ChatGPT’s favor had I used o1 but that wouldn’t have been a fair comparison. Also, Grok 3 is on the horizon and may be out before GPT-5.

More from Tom’s Guide

Source link