Four Beautiful Legs
There's going to be three sections of this article: a deep dive on tools I've tested and what I think so far, a shallow report on what the deep dive key takeaways will be, and a how-to for copying my exact setup. If you don't want to put up with me waxing poetic about metaphorical horse legs for the next 1000 or so words and just want my guide to cloning my AI coding setup, skip to Steal My Setup.
We're going to really drive the centaur metaphor into the ground. It's only post two, and I'm actually personally already tired of it. But I love when they say the name of the movie in the movie, and this blog won't be any different. Today we're talking about the horse half of our centaur ("horse half" in this case meaning the six NVIDIA A100's running full tilt somewhere in Kentucky so that I don't have to write my for loops correctly).
Deep Dive
The term "slop" gets thrown around a lot in discourse around LLM-generated content. Generally, it's shorthand for content generated badly using LLMs that people publish anyway. There are gradations to slop, ranging from allegations of huge data breaches resulting from AI-generated code to certain AI company CEOs forcing you to look at their AI-generated Ghibli avatars.
But I'd argue the slop discourse stems, in no small part, from a skill issue. This is particularly true as we've distanced ourselves from the GPT-3.5 era and moved into more advanced models. I'd say I consistently get better code and writing out of working with LLMs than I would have written myself. My code tends to be a sort of cinderblock house. It's functional, but you're sorely missing the window trimmings. Claude's includes input validation, proper error messages, and docstrings I'd never write.
Getting to the point where I can say that, though, took work. I've done a lot of experimenting. To find my legs, I've tested (in no particular order): Cursor, Void, VS Code with Copilot, Gemini CLI, Firebase Studio, Cline, Bolt, Codex, Replit, and Claude Code. Claude code I've tested with SuperClaude, Claude-Flow, and BMAD. I'm in the process of testing Kiro and don't have a full opinion on that yet but maybe I'll update y'all later, who knows. In Cline, I've at least lightly tested Deepseek-R1, Gemini 2.5 Pro, Kimi K2, Qwen-3-Coder, GPT-5, Claude 4.0 Sonnet (with and without the upgraded context length), and Claude 4.1 Opus. I've also tested local models ranging from Gemma 3 up to gpt-oss-20b and as large as Qwen3 Coder 30b using Ollama and LM Studio.
You'll notice I've not tested Grok; I appreciate you noticing.
Broadly, my experience is always best using Claude. If benchmarks are to be believed, this is a combination of the core model being very performant and first mover advantage that has resulted in a robust suite of supportive frameworks.
Anthropic created the model context protocol (MCP), effectively resulting in the opportunity to bolt bespoke upgrades and skills onto your model. There's already a huge number of these toolsets, and they really run the gamut from letting Claude login to AWS (shiver) to giving it real time access to WWII facts (I will not use this as an opportunity to take a shot at Grok, I'm far too mature).
This customizability is the foundation of our four beautiful legs. When we think about where LLMs fail, the largest immediate problem is hallucination. At their core, everything from broken code to invented sources are the result of LLMs incorrectly recalling training data. One way to clean this up is simply to inject fresh, correct information on demand as the LLM operates (there's a few ways to handle this — RAG comes to mind and is worth reading about). Another is to create specific tools that enforce correct behavior when the LLM chooses to use it. MCP helps with both.
To be honest, we're at the edges of my understanding here; I'm still trying to learn more about all of this. For example, when Claude uses an MCP to check data before it answers, I'm not sure if it is technically always using RAG or not ... I've tried to read up and figure it out but keep finding different answers (if you understand this/are an expert, please hit me up). This field is moving very fast, and these systems are rich and complicated.
But, mercifully, the existence of these systems has unleashed a tidal wave of benign internet nerds who build systems that insulate against some of the core problems with skillful LLM use. The SuperClaude framework is a great example of this. Are you bad at prompting? Do you just hate having to role-play (if I have to tell Claude that it is an expert in software engineering one more time I'll lose my mind)? Do you not know how to ask Claude to best structure your data analysis? Are you woefully unaware of what makes good UI? Me too! But with a framework like SuperClaude, my lazy and ham-fisted attempts to coax greatness from the machine get an instant upgrade every time I prompt.
Frameworks like this exist in many forms. BMAD comes to mind. And while I think BMAD is cool and has a huge amount of potential to be a well built general purpose framework, the secret sauce of the inbuilt MCPs that SuperClaude runs are sorely missed. On the flip side are tools like Claude Flow, which have a huge host of MCP tools built in and extremely advanced features. I also think Claude Flow is pretty awesome, but it lacks the simplicity of use that something like SuperClaude has with its built in, under the hood expert picking, and its lack of complications like running multiple agents at a time (another side note on this: Claude Flow is a token incinerator, goodness gracious).
Now, Claude isn't the only agent that can use MCP and has frameworks at this point. But, I think this is first mover incarnate. I'm keeping an eye on how things look catching up, but as long as Kimi-K2 keeps swinging and missing at half the tool calls it attempts in Cline, I'll stick with Claude.
Ok, that was a lot. Let's summarize!
Bottom Line
The results of all this testing can be summarized with the following:
- For almost everything you want to do, Claude 4.0 Sonnet is going to be the best option. For long form writing in particular, Claude 4.1 Opus is a nice partner. For research, Gemini 2.5 Pro is the standout.
- ChatGPT won't suck for most uses and is generally going to give you a fine experience.
- The open source models primarily coming from research groups in China (Kimi, Deepseek, Qwen, etc) are both an immense gift and broadly overhyped right now. What they possess in benchmarks, affordability, and openness they lack in feature support and consistency.
- Local models are a novelty with very limited use right now but are improving rapidly
The setup I've settled into is using Cline for model-testing. And then I use Claude Code for almost everything else.
So, how do I use Claude Code? Steal My Setup here!
Housekeeping if you're still here
I don't really have defined schedule I'm planning to release this stuff on. The next post will probably cover how I've been thinking about not letting my brain rot while handing off cognitive load to LLMs and AI systems. Something something image abstract thinking. Sign up below so you can get that when I get around to posting it in six months!
Also, I write this blog to learn as much as I write it to document my thoughts and (hopfully) teach. I'm by no means an expert. If you see anything in here that is wrong (or even that could just be improved in accuracy), please don't hesitate to reach out!!