GPT-5 is a freak

TL;DR GPT-5 is presented as OpenAI's latest and most capable "AI system," excelling particularly in STEM subjects like coding, math, science, and research. The review showcases its impressive ability to generate complex, interactive HTML applications from single prompts, including simulations (beehive, fluid dynamics, ray tracing), games (3D racing), and practical tools (CRM dashboard, Photoshop clone, video editor, meditation guide). It also demonstrates strong research and information synthesis capabilities, especially in health-related queries, and a significantly reduced hallucination rate compared to previous models. While it shows minor flaws in some generated UIs and image consistency, GPT-5 consistently ranks #1 across various independent benchmarks for coding, creative writing, and overall performance, offering competitive pricing. The video concludes by highlighting the accelerating pace of AI development, with new state-of-the-art models emerging every few weeks.

Information Mind Map

🧠 GPT-5: A Comprehensive Review

🔑 Overview & Core Strengths

Introduction: Full unscripted review covering capabilities, limitations, specs, performance, and comparisons.
Primary Focus: Excels across STEM subjects (coding, math, science, research).
General Tasks: Capable of simple tasks like writing essays or replying to emails, but other models perform these well too.
Sponsor: Thanks to HubSpot.

🔑 Capabilities & Practical Demonstrations

💡 Coding & Interactive Applications (HTML focus)

Beehive Construction Simulation:
- Prompt: "Make a visual simulation of a beehive construction... Include sliders for colony size and resource availability. Put everything in a standalone HTML file."
- Outcome: Successfully generated interactive simulation with expanding hive, pollen gathering, honey storage, and functional sliders (colony size, resource availability).
- Features: Pause/Reset buttons worked.
- Impression: Very impressive, zero-shot generation.
3D Racing Game:
- Prompt: "Make a 3D racing game through neon lit cyberpunk tracks with speed boosts and collision physics. Put everything in a standalone HTML file."
- Outcome: Functional 3D racing game with neon lit cyberpunk track.
- Features: Yellow blocks slow down, pink blocks give speed boost.
- Response Style: Short and concise, "no BS model" similar to Gro 4.
- Impression: Less error-prone for simple app coding than other top models.
Fluid Dynamics Visualizer:
- Prompt: "Create an animation of fluid dynamics, include interactivity and sliders with multiple color dyes. Put everything in an HTML file."
- Outcome: Interactive visualizer with multiple color dyes (red, blue, yellow, green).
- Features: Sliders for diffusion, viscosity, time step, dissipation, vorticity, dye amount. Randomize colors and clear functions. Resolution selection.
- Impression: Simulates fluid dynamics and color mixing effectively, zero-shot.
Real-time Ray Tracing Simulation:
- Prompt: "Develop a real-time ray tracing simulation featuring a metallic sphere suspended above a street scene. Use any 3D street view environment and allow adjustable parameters such as reflectivity, roughness, and other material properties of the sphere."
- Outcome: Generated a metallic sphere reflecting environment.
- Self-Correction: Detected and fixed its own error (fix bug button).
- Features: Adjustable roughness, metalness, exposure, sphere height. Reflectivity slider did not visibly work. Clear coat and clear coat rough had subtle, unclear effects.
- Impression: Good at ray tracing and creating physically correct objects.
CRM Dashboard:
- Prompt: "Create a beautiful CRM dashboard that offers real-time insights into sales, customer engagement, and marketing campaigns. Include interactive graphs and charts, etc., etc. Put everything in one HTML file."
- Outcome: Fully interactive dashboard with total sales, conversion rate, new leads, pie chart, revenue trend (with 7-day moving average), campaign performance, engagement heat map.
- Flaws: Sales funnel did not look perfect. Values were made up.
- Features: Time range selection, live toggle, widget selection for display, draggable widgets.
Photoshop Clone:
- Prompt: "Create a clone of Photoshop with all the basic tools."
- Outcome: Functional Photoshop-like application.
- Features: Brush, eraser, layers (add, deselect, move), fill buckets, shapes (line, rectangle, ellipse), text tool, crop, select (drawing within selection), pan, zoom, move.
- Image Editing: Brightness, saturation filters, fit to screen, background color (BG) change, opacity, blend modes (multiply, screen, overlay).
- Impression: Extremely powerful for creating basic apps, none of the other top models could do this zero-shot.
Video Effects Editor:
- Prompt: "Create a page where I can upload a video and apply different advanced effects to it in real time. Show the uploaded video and the final video side by side."
- Outcome: Video editor with real-time effects.
- Effects: None, grayscale, sepia, invert, brightness and contrast, saturation, hue, sharpen, gaussian blur, edge detect, RGB split, pixelate, posterize.
- Flaws: Vignette effect was incorrect (applied to center instead of edges).
- Impression: All settings except vignette worked, coded zero-shot.
Mindfulness Meditation Guide:
- Prompt: "Make a single interactive page for a mindfulness meditation guide. Generating calm fractal patterns that evolve with breathing exercises with sounds. Include timers and progress trackers."
- Outcome: Interactive meditation guide.
- Features: Adjustable length, goal breaths, theme (forest, dusk), fractal style (branches, fern) animating with breathing phases (exhale, inhale, hold). Different breathing patterns (478, calm), customizable cycles.
- Audio: Background noise options (swell, binaural) for guided breathing. Binaural sounds were 3D.
- Flaws: White on white text issue.
- Impression: No need for paid meditation apps, zero-shot generation.

💡 Vision & Image Understanding

Photo Location Guessing:
- Task: Identify event name and location from a concert photo with minimal clues (no main stage, stripped metadata).
- Outcome: Correctly identified "Symphony at Sunset event" at "Sunset Beach Park."
- Impression: Scary accuracy, previous OpenAI models (03, 04) were also good at this.

💡 Science & Research

Taxonomy Tree of Big Cat Species:
- Prompt: "Create a taxonomy tree of big cat species, displaying their classification from family to genus to species with hover over species descriptions."
- Outcome: Functional taxonomy tree (family fil, subfamilies, genus expansion).
- Features: Hover over descriptions appeared.
- Flaws: Pop-up didn't appear next to the hovered item.
- Impression: Information was accurate.
Interactive High School Physics Course:
- Prompt: "Create an interactive course on high school physics with visualizations and animations. Include just the first three lessons for now."
- Lessons:
  1. Motion and Kinematics: Interactive animation with adjustable initial velocity, acceleration, and displayed metrics.
  2. Forces and Newton's Laws: Object movement with adjustable static friction and metrics.
  3. Pendulum Dynamics: Visualizes potential and kinetic energies fluctuating, adjustable settings and simulation speed.
- Flaws: Labels for metrics jumbled together in lesson 2.
Business Intelligence Report (E-commerce Asia):
- Prompt: "Create a comprehensive business Intel report on e-commerce growth in Asia from 2020 to 2025." (Web search enabled).
- Outcome: Report with market size and growth, regional/country insights, growth drivers, market segmentation, challenges, summary table, key insights.
- Style: Very short and concise, dense information with citations.
- Comparison: Tends to produce shorter answers than GLM 4.5 or Kim K2. Better for compact information.
Medical Research Report (Alexander Disease):
- Prompt: "The patient has Alexander disease... Research everything about the subject and suggest next steps or possible ideas for cures. Compile everything into a report with charts and graphs."
- Outcome: Report with definition, current clinical management, research and experimental therapies, summary table, suggested next steps.
- Style: Super short and dense, appropriate citations.
- Comparison: Shorter than GLM 4.5 or Kim K2.
Sports Medicine Report (ACL Injury Rehab):
- Prompt: "25-year-old athlete with ACL injury, research, rehab protocols, and return to sport timelines. Suggest preventative training in a sports medicine report with recovery phase graphs." (Web search enabled).
- Outcome: Report broken into phases with estimated timeline, return to sport timing and risks, preventative training recommendations, rehab and prevention plan.
- Style: Again, super short.

💡 Image Generation

Storybook Generation:
- Prompt: "Generate a five-page story book about a frog who wants to be rich. Generate images for each page."
- Initial Attempt: Only gave one image, then another, not a full story.
- Agent Mode: Enabled agent mode to autonomously generate the storybook.
- Process: Generated images step-by-step, then ran Python code to convert to PDF.
- Outcome: PDF storybook with five pages.
- Flaws: Pages could be formatted more nicely, no cover page. Frog character not consistent across images.
- Image Model: Uses the 4o image model, no new image generator for GPT-5.

💡 Hallucination Test

Stable Diffusion 5 (Non-existent):
- Prompt: "Give me all the details about stable diffusion 5."
- Outcome: Correctly stated no official release or announcement for SD5, confirmed SD 3.5 as latest.
- Impression: Pass – did not hallucinate.

🔑 Usage & Accessibility

Availability: Should be out on ChatGPT for everyone, including free users.
Free Plan: Limited number of uses per day, then falls back to less intelligent model.
Paid Plan: Explicitly select GPT5.
Observation: Free plan responses seemed not as good as paid plan, suspecting a smaller variant (mini or nano) for free users. All video tests used the paid plan.

🔑 Technical Specs & Benchmarks

Definition: OpenAI calls GPT5 an AI system, not just a large language model.
- Smart router: Combines several internal models, automatically decides which model to use based on prompt.
- Continuous training: Router improves over time based on user feedback (model switching, preference rates, correctness).
- Black box: Proprietary and closed source.
Key Improvements: Significantly reduced hallucinations, particularly good in writing, coding, and health.
OpenAI Reported Benchmarks (Internal Comparison):
- AIME (Competitive Math):
  - GPT5 Pro (with thinking, Python usage): 100%.
  - Comparison: Gro 4 heavy also achieved 100%, suggesting this benchmark is easy to beat.
- Frontier Math: GPT5 with thinking performs better than GPT agent. (Cherry-picking, 03 high not shown).
- GPQA Diamond (Graduate Science): GPT5 on average beats 03 by a small margin.
- Humanity's Last Exam (Obscure Science):
  - GPT5 (no thinking, no tools): 6.3% (pretty bad).
  - GPT5 (with thinking): Higher.
  - GPT5 Pro (Python, search): 42%.
  - Comparison: Grock 4 heavy (Python, internet): 44% (even better, not state-of-the-art here).
- SWEBench Verified (Software Engineering):
  - GPT5 (with thinking): 74.9% (currently the best score among AI models).
  - Comparison: Claude Opus 4.1 (latest version): 74.5%.
  - Impression: Slightly better in coding, consistent with testing experience.
- Minimal Improvements: Generally minimal improvements (e.g., ~2% better than 4o without thinking) over previous OpenAI models for many benchmarks.
- HealthBench Hard (Challenging Health Questions):
  - GPT5 (non-thinking): 25% (vs. 4o at 0%).
  - GPT5 (with thinking): 46% (way better than 03).
- Hallucination Rate (Health Questions):
  - GPT5 (with thinking): 1.6% (lowest).
  - Concern: 03 and 40 rates concerningly high (e.g., 15%).
Independent Leaderboards (External Comparison):
- LM Arena (Blind Test): Scores number one across all categories.
  - Categories: hard prompts, coding, math, creative writing, instruction following, longer query, multi-term.
  - Impression: Pretty impressive.
- LiveBench by Abacus AI: Ranked number one, slightly beating 03 Pro high.
- Artificial Analysis: Ranked number one (high version), one point above Grock 4.
- Creative Writing Benchmark: Ranked number one, slightly ahead of Kimmy K2 and Claude Opus 4.1. Best quality for stories/novels.
- Confabulations (Hallucination Rate):
  - Lower value is better (less hallucination).
  - GPT5: Ranked number one (lowest hallucination rate).
  - Comparison: Better than GLM 4.5, Quen 3, Gemini 2.5 Pro.
  - Verification: Confirms OpenAI's claim of significantly less likely to hallucinate.
Pricing:
- GPT5 high: $3.4 per 1 million tokens.
- Comparison: Same as Gemini 2.5 Pro, way cheaper than Gro 4.
- Conclusion: Pretty damn good in terms of both intelligence and cost effectiveness.

🔑 Conclusion & Future Outlook

Overall Impression: Really good at coding, less error-prone than other top models.
Progress Pattern:
- Gemini 2.5 Pro was world's most powerful a few months ago.
- Gro 4 then became world's best a few weeks later.
- Now OpenAI's GPT5 is the world's best model, ranking #1 across leaderboards.
Rate of Progress: Absolutely insane, new models in weeks instead of months.
Excitement: Exciting times to be alive.
Actionable Items:
- Subscribe to free weekly newsletter for AI updates.
- Like, share, subscribe for more content.

Jarvis-Legatus/mindmap_IrWtw9ehB2g.md

Select an option

No results found