TL;DR GPT-5, accessed via API, demonstrates significant advancements, particularly in coding and visual generation. The model is multimodal, capable of accepting image inputs and generating high-quality visuals with superior clarity compared to models like Opus 4. It excels at creating functional and visually appealing websites, though some "AI tells" like misplaced elements or formatting issues can occur, which are often fixable with iterative, detailed prompting. GPT-5 also successfully tackles complex coding challenges, including realistic physics simulations (bouncing ball) and Rubik's Cube solution simulations using algorithms like Cumba. Furthermore, it can solve challenging problems like Mathematics Olympiad questions, albeit with varying success rates and requiring high reasoning effort. While impressive, its reasoning capabilities still show limitations, as seen in its initial failure to adapt to a modified classic riddle, defaulting to its training data until explicitly corrected. The presenter concludes GPT-5 is OpenAI's best coding model yet, a notable step up from GPT-4, but cautions against AGI hype.
Information Mind Map
- Source: First confirmed outputs from GPT-5 via API (not chatbot arena version).
- Overall Assessment: Really good, especially for coding.
- Model Type:
- Multimodal: Can accept images as input.
- Image Generation: Can generate images (uncertain if native or via DALL-E 3/O4 backend).
- Reasoning Model:
Reasoning efforts
can be set (low, medium, high) similar to O3.
- Key Strength (Out of the Box): Quality of visuals generated is significantly better than other models tested.
- Comparison: GPT-5 visuals show more clarity than Opus 4 for the same prompt.
- Quality: Definitely much better than anything seen so far (based on presenter's tests).
- Creativity: Seems very creative when creating visuals.
- Instruction Following: Generates very detailed visuals; output quality is highly dependent on prompt quality.
- Example 1: 25 Legendary Pokemon Website
- Visually appealing and completely functional.
- Added Features (without explicit prompt):
- Dark/Light theme switch button.
- Bookmarking functionality.
- "AI Tells" / Issues:
- Misplaced theme button (expected top corner, was elsewhere).
- Generated from simple text prompt; fixable with iterative feedback.
- Example 2: Another Website Output
- Considered the "best output" seen for that prompt.
- Includes dark/light mode toggle, reading animations with card flips.
- Issues:
- FAQs out of margins.
- Text upside down on some cards.
- Note: These are single-trunk issues, easily fixable in subsequent iterations.
- Prompting Requirement:
- Don't expect magic from simple prompts.
- Requires
highly detailed prompts
. - Positive: GPT-5 sticks to instructions very closely, preventing "making things up."
- Bouncing Ball Simulation (Hexagon)
- Presenter's go-to LLM test for complexity.
- GPT-5 "nails it" β outputs are more realistic (bouncing, height).
- Rubik's Cube Solution Simulation
- Simulator created with GPT-5.
- Algorithm Used:
Cumba algorithm
(well-known). - Comparison: Only other model to get it right was Gemini 2.5 Pro (Matthew Berman's test).
- Performance (Live Test):
- Attempt 1: Failed to solve.
- Attempt 2: Failed to solve.
- Attempt 3: Solved (after refreshing page, indicating initial conditions matter).
- Attempt 4: Solved.
- Summary: Solved 2 out of 4 real-time tests.
- Observation: Sometimes rendering issues with blocks, but not a problem with the simulator logic itself.
- Olympiad Level Questions:
- Tested with a problem from a recent International Mathematics Olympiad.
- Result: Came up with the correct final solution (same as official solution).
- Time Taken: Approximately 10 minutes.
- Significance: One of the few models seen capable of solving such problems or getting the final answer right.
- Context: OpenAI previously claimed an experimental model solved 5/6 problems from this set.
- Testing Method: Used "misguided attention" dataset to test logical deduction.
- Premise: A truly reasoning model should focus on modified question wording, not just training data.
- Test Case: Farmer, wolf, goat, cabbage riddle (modified to only care about transferring the goat safely).
- Initial Attempt (High Reasoning Effort):
Reasoning Process
: Identified as "wolf, goat, cabbage puzzle."Output
: Provided the full 7-step solution for all items, ignoring the specific instruction to only care about the goat.
- Subsequent Question ("What's wrong with your solution?"):
Model's Response
: Acknowledged it "assaulted the full move all three across puzzle" and that the question "only asked how to get the goat across safely."Corrected Answer
: Stated "just take it on the other side. I don't have to worry about the wolf and the cabbage."Final Response
: Still provided the full 7-step solution, but added a note that if the only goal was the goat, one could stop earlier.
- Initial Attempt (High Reasoning Effort):
- Conclusion: Still prone to defaulting to well-known patterns from training data unless explicitly nudged and corrected.
- Overall: Definitely one of the best models from OpenAI seen so far.
- Performance vs. GPT-4: Seems better than standard GPT-4 for visualizations.
- Agent Code: Not thoroughly tested yet; community will figure it out. Cursor was testing it as an agent in early preview.
- AGI Status:
It's nowhere close
to AGI; cautions against hype cycles (similar to GPT-4 release). - Step Up: Believed to be a similar step up from GPT-4 as GPT 3.5 was to GPT-4, especially for coding.
- Community Feedback: Important to wait for feedback after initial hype (first week is usually all hype).
- Actionable Items:
- Explore Cursor's early preview testing of GPT-5 as an agent.
- Monitor community feedback on GPT-5's performance in the coming weeks.
- Experiment with highly detailed prompts when using advanced LLMs for coding/visuals.