Skip to content

Instantly share code, notes, and snippets.

@Jarvis-Legatus
Created August 8, 2025 16:04
Show Gist options
  • Save Jarvis-Legatus/80be742941fbb0d5530f51cee382ef78 to your computer and use it in GitHub Desktop.
Save Jarvis-Legatus/80be742941fbb0d5530f51cee382ef78 to your computer and use it in GitHub Desktop.
Mind map for YouTube video: GPT-5 First Coding Tests: Coding Beast

GPT-5 First Coding Tests: Coding Beast

TL;DR GPT-5, accessed via API, demonstrates significant advancements, particularly in coding and visual generation. The model is multimodal, capable of accepting image inputs and generating high-quality visuals with superior clarity compared to models like Opus 4. It excels at creating functional and visually appealing websites, though some "AI tells" like misplaced elements or formatting issues can occur, which are often fixable with iterative, detailed prompting. GPT-5 also successfully tackles complex coding challenges, including realistic physics simulations (bouncing ball) and Rubik's Cube solution simulations using algorithms like Cumba. Furthermore, it can solve challenging problems like Mathematics Olympiad questions, albeit with varying success rates and requiring high reasoning effort. While impressive, its reasoning capabilities still show limitations, as seen in its initial failure to adapt to a modified classic riddle, defaulting to its training data until explicitly corrected. The presenter concludes GPT-5 is OpenAI's best coding model yet, a notable step up from GPT-4, but cautions against AGI hype.


Information Mind Map

🧠 GPT-5 First Coding Tests: Coding Beast

πŸ”‘ Key Features & Initial Impressions

  • Source: First confirmed outputs from GPT-5 via API (not chatbot arena version).
  • Overall Assessment: Really good, especially for coding.
  • Model Type:
    • Multimodal: Can accept images as input.
    • Image Generation: Can generate images (uncertain if native or via DALL-E 3/O4 backend).
    • Reasoning Model: Reasoning efforts can be set (low, medium, high) similar to O3.
  • Key Strength (Out of the Box): Quality of visuals generated is significantly better than other models tested.
    • Comparison: GPT-5 visuals show more clarity than Opus 4 for the same prompt.

πŸ”‘ Core Capabilities & Examples

1. Visual Generation

  • Quality: Definitely much better than anything seen so far (based on presenter's tests).
  • Creativity: Seems very creative when creating visuals.
  • Instruction Following: Generates very detailed visuals; output quality is highly dependent on prompt quality.

2. Website Generation

  • Example 1: 25 Legendary Pokemon Website
    • Visually appealing and completely functional.
    • Added Features (without explicit prompt):
      • Dark/Light theme switch button.
      • Bookmarking functionality.
    • "AI Tells" / Issues:
      • Misplaced theme button (expected top corner, was elsewhere).
      • Generated from simple text prompt; fixable with iterative feedback.
  • Example 2: Another Website Output
    • Considered the "best output" seen for that prompt.
    • Includes dark/light mode toggle, reading animations with card flips.
    • Issues:
      • FAQs out of margins.
      • Text upside down on some cards.
    • Note: These are single-trunk issues, easily fixable in subsequent iterations.
  • Prompting Requirement:
    • Don't expect magic from simple prompts.
    • Requires highly detailed prompts.
    • Positive: GPT-5 sticks to instructions very closely, preventing "making things up."

3. Complex Coding & Simulations

  • Bouncing Ball Simulation (Hexagon)
    • Presenter's go-to LLM test for complexity.
    • GPT-5 "nails it" – outputs are more realistic (bouncing, height).
  • Rubik's Cube Solution Simulation
    • Simulator created with GPT-5.
    • Algorithm Used: Cumba algorithm (well-known).
    • Comparison: Only other model to get it right was Gemini 2.5 Pro (Matthew Berman's test).
    • Performance (Live Test):
      • Attempt 1: Failed to solve.
      • Attempt 2: Failed to solve.
      • Attempt 3: Solved (after refreshing page, indicating initial conditions matter).
      • Attempt 4: Solved.
      • Summary: Solved 2 out of 4 real-time tests.
      • Observation: Sometimes rendering issues with blocks, but not a problem with the simulator logic itself.

4. Mathematics Problem Solving

  • Olympiad Level Questions:
    • Tested with a problem from a recent International Mathematics Olympiad.
    • Result: Came up with the correct final solution (same as official solution).
    • Time Taken: Approximately 10 minutes.
    • Significance: One of the few models seen capable of solving such problems or getting the final answer right.
    • Context: OpenAI previously claimed an experimental model solved 5/6 problems from this set.

πŸ”‘ Reasoning Capabilities & Limitations

  • Testing Method: Used "misguided attention" dataset to test logical deduction.
    • Premise: A truly reasoning model should focus on modified question wording, not just training data.
  • Test Case: Farmer, wolf, goat, cabbage riddle (modified to only care about transferring the goat safely).
    • Initial Attempt (High Reasoning Effort):
      • Reasoning Process: Identified as "wolf, goat, cabbage puzzle."
      • Output: Provided the full 7-step solution for all items, ignoring the specific instruction to only care about the goat.
    • Subsequent Question ("What's wrong with your solution?"):
      • Model's Response: Acknowledged it "assaulted the full move all three across puzzle" and that the question "only asked how to get the goat across safely."
      • Corrected Answer: Stated "just take it on the other side. I don't have to worry about the wolf and the cabbage."
      • Final Response: Still provided the full 7-step solution, but added a note that if the only goal was the goat, one could stop earlier.
  • Conclusion: Still prone to defaulting to well-known patterns from training data unless explicitly nudged and corrected.

πŸ”‘ Final Thoughts & Outlook

  • Overall: Definitely one of the best models from OpenAI seen so far.
  • Performance vs. GPT-4: Seems better than standard GPT-4 for visualizations.
  • Agent Code: Not thoroughly tested yet; community will figure it out. Cursor was testing it as an agent in early preview.
  • AGI Status: It's nowhere close to AGI; cautions against hype cycles (similar to GPT-4 release).
  • Step Up: Believed to be a similar step up from GPT-4 as GPT 3.5 was to GPT-4, especially for coding.
  • Community Feedback: Important to wait for feedback after initial hype (first week is usually all hype).
  • Actionable Items:
    • Explore Cursor's early preview testing of GPT-5 as an agent.
    • Monitor community feedback on GPT-5's performance in the coming weeks.
    • Experiment with highly detailed prompts when using advanced LLMs for coding/visuals.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment