AI Adventures: My Experience with OpenAI, DeepSeek, and More
Introduction
I’m excited to share my second deep dive into AI (Artificial Intelligence). As a software developer who loves creating new things, I’ve found AI to be the perfect companion for boosting productivity and fueling creativity. Sometimes it feels almost miraculous how quickly AI can help me build, test, and refine new ideas.
(By the way, I currently have a GPT Pro account that costs around €200 per month. I know… it sounds expensive! But so far, it’s been worth every penny… or so I keep telling myself!)
Benchmarking the Big Players
If you’ve spent any time online, you’ve likely seen AI model comparisons floating around. They’re fascinating because each model’s capabilities and limitations really stand out. From GPT-4o to Claude 3.5 Sonnet, everyone wants to know: Who’s on top?
Recently, a matchup between two distinct models caught my attention:
- OpenAI’s o3-mini (high variant)
- DeepSeek R1 (from DeepSeek AI in China)
A Notable Benchmarking Test
Dan Hendrycks shared a benchmark comparing several major AI models:
- OpenAI’s o3-mini
- DeepSeek-R1
- GPT-4o
- Claude 3.5 Sonnet
- Grok-2
- Gemini Thinking
This test focused on accuracy and calibration error on Humanity’s Last Exam, a dataset designed to push AI capabilities in reasoning-intensive domains like math, physics, medicine, and ecology.
Key Findings:
- GPT-4o and Grok-2 had lower accuracy scores (3.3% and 3.8% respectively).
- Claude 3.5 Sonnet did slightly better at 4.3%, while Gemini Thinking hit 7.7%.
- DeepSeek-R1 and OpenAI’s o1 performed competitively (9.4% and 9.1%).
- OpenAI’s o3-mini (high variant) led the pack with an accuracy of 13.0%.
This was a big wake-up call to many in the AI community, showing how some “lighter” OpenAI models can still outperform bigger names.
Sam Altman’s Response on X
OpenAI CEO Sam Altman chimed in on Dan Hendrycks’ post, writing:
“We will need another exam soon…”
It’s a short but intriguing comment, hinting that OpenAI might already be planning more advanced, real-world tests for future AI models.
A Few Days Later: An Updated Benchmark Emerges
Just days after the initial benchmark, Dan Hendrycks released a follow-up test [Optional Image Placeholder: Insert Second Benchmark Screenshot] featuring OpenAI’s Deep Research model, which integrates advanced browsing and Python tool capabilities.
New Results:
- OpenAI Deep Research model scored an impressive 26.6%, eclipsing all previous models.
- o3-mini (high) stayed at 13.0% accuracy, but Deep Research leveraged external tools to boost complex reasoning tasks significantly.
This suggests that the future of AI benchmarking might focus not just on “base models,” but also on how well they integrate external resources.
Putting AI Models to the Test
I’ve looked into various benchmarks and run my own experiments to see how models handle real-world challenges. Below are a few notable examples from my coding, debugging, and logic tests. Some are my own experiments, while others come from well-known AI benchmarks.
Example 1: Bouncing Ball in a Spinning Hexagon
I asked both o3-mini and DeepSeek R1 to:
“Write a Python program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically.”
- o3-mini: Handled it well, with accurate collisions and gravity.
- DeepSeek R1: Struggled; the generated code didn’t reflect the dynamics of a rotating hexagon.
Example 2: Steganography With a Twist
I have a Steganography project on GitHub (secret agent stuff—really fun!). I introduced a bug in my code, then asked ChatGPT (o1 Pro) and DeepSeek R1 to find and fix it:
- ChatGPT o1 Pro: Found the bug and delivered a working fix in about 3 minutes.
- DeepSeek R1: Returned a “server is busy” message after 9 minutes, and later provided code that did not solve the issue.
This underscored how ChatGPT can quickly adapt and debug complex tasks.
Handling Typos and Natural Language Glitches
I also noticed that ChatGPT tends to handle my (sometimes) typo-filled or hurried questions quite gracefully. Even if I type something like “shcedkjsiimng” instead of “scheduling,” it usually figures out my intention from context. DeepSeek, on the other hand, often responds with generic boilerplate if the input is unclear.
A Broader Look: Letter-Dropping Physics
Another test involved o3-mini, DeepSeek R1, and Claude 3.5 Sonnet:
“Create a JavaScript animation of falling letters with realistic physics. The letters should:
- Appear randomly at the top of the screen with varying sizes
- Fall under Earth’s gravity (9.8 m/s²)
- Have collision detection based on their actual letter shapes
- Interact with other letters, ground, and screen boundaries
- Have density properties similar to water
- Dynamically adapt to screen size changes
- Display on a dark background”
- Claude 3.5 Sonnet and o3-mini produced workable code with realistic physics.
- DeepSeek R1 struggled to incorporate collision detection and varying densities correctly.
Final Thoughts
AI technology moves so fast that each new model can seem like a revelation. DeepSeek R1 does handle casual tasks reasonably well, and it may improve in the future. However, in my personal experience, OpenAI’s models feel more polished and dependable for coding and complex problem-solving.
I’ll keep sharing updates and personal experiments with GPT-based models (and others!) in upcoming posts. In the meantime, I’d love to hear your thoughts: Which AI tools have you tried, and how have they worked out for you?
Thanks for reading, and stay tuned for more AI experiments and adventures!
Comments
Post a Comment