The Stochastic Code Monkey Theorem

The relentless hype machine surrounding large language models would have you believe we are on the cusp of a software development revolution, a new epoch where programmers are obsolete, their keyboards gathering dust while throwing enough GPUs and parallel stochastic code monkeys at any problem will conjure the coding prowess of Fabrice Bellard and Jeff Dean given enough time and compute. This narrative is (not surprisingly) mostly spun by podcast hucksters and influencers rather than actual engineers and is, to put it mildly, a fanciful exaggeration. For the seasoned software engineer, these newfangled tools are not our replacements, but rather powerful, if occasionally unwieldy and infuriating, power tools that demand a firm hand and a watchful eye.

Language models, in their current incarnation, are a curious blend of savant and simpleton. Take Gemini 2.5 Pro, for instance. Google’s latest concoction is shockingly adept at deep analysis, capable of ingesting a sprawling codebase of hundreds of thousands lines and pinpointing a single typo with unnerving accuracy. Whatever pre-training secret hot sauce they are putting in their chili in Mountain View evidently has quite the kick. Yet, ask this same model to perform a rudimentary regex string replacement, and you'll see the digital equivalent of a toddler’s vacant stare as the agent spins in a loop trying to do a basic text edit on a file. It's a jarring synthesis of a post-doctoral computer science researcher and a toddler.

On the other hand, we have Anthropic's Claude Opus 4.1 and its more frugal cousin, Sonnet 4.0. These models are the reliable workhorses of the current generation, they demonstrating a consistent aptitude for following instructions, refactoring code with precision, and generally behaving as a competent, if uninspired, pair programmer. While they may lack the occasional flashes of sheer brilliance that Gemini 2.5 can exhibit, their steadfastness makes them invaluable for the day-to-day grind of software creation. The truly potent combination, for those with the resources, is a strategic deployment of both: Gemini for its huge context window and analytical prowess and Claude for its dependable execution.

Notably absent from the roster of genuinely useful coding companions are the over-hyped offerings from OpenAI. Even their latest GPT-5 model proves to be next to useless for anything beyond the most trivial, one-off scripts, regularly failing to grasp the nuance of larger, interconnected systems and having a vastly outdated (early 2024 era) training set. Their specialized reasoning models, meanwhile, are so sluggish and cost-inefficient as to be non-starters for the rapid, iterative cycle of professional software development. For the vast majority of coding tasks, they have been thoroughly eclipsed, falling well shy of the Pareto frontier where rivals comfortably reside. In the crucial calculus of performance versus price, the OpenAI models simply do not present a compelling case.

To truly harness these models for programming, we can't really use the simplistic web APIs like the standard ChatGPT interface, which are functionally useless for any task of meaningful complexity. An effective coding assistant requires codebase situational awareness, the ability to index symbols, call edit tools, execute searches, leverage external linters, and critically, interpret diagnostics from Language Server Protocol. Without this deep environmental grounding, the model is essentially coding blindfolded. The genuinely potent implementations are therefore found in standalone command-line interfaces and tightly integrated development environments like Zed, VS Code, or the myriad plugins for Vim (or Emacs for the heathens) that provide this context. A word of caution, however, regarding the new breed of venture-backed commercial tools; their pricing models are often deliberately opaque, functioning as the Uber of coding assistance. Do not become accustomed to the VC subsidy, for it will not last. The strategy is transparent: get you hooked on a powerful, underpriced service, and once you are dependent, extract maximum value. The first hit, as always, is free.

A disquieting truth about the software economy is that a substantial portion of it is not concerned with elegant architecture, security, or maintainability. It is a vast expanse of braindead JSON plumbing, YAML farming, boilerplate CRUD applications, and uninspired data dashboards. Graeber's anaysis of Bullshit Jobs applies just as much to the information technology profession as it does to the larger economy. Most software is bullshit, which is a perfect for a technology which can aptly be described as a stochastic bullshit engine, in the precise philosophical use of the word "bullshit". This is the "subprime software market," a realm of throwaway code destined for a rewrite the moment it encounters the harsh realities of production. Language models are unnervingly proficient at churning out "make me a TODO app in React" solutions that are just good enough to pass a superficial inspection. It's a market that perhaps shouldn't exist, but its presence is undeniable, and these models are poised to consume it whole, and with alarming speed.

This proliferation of AI-generated code brings with it a cybersecurity nightmare of unprecedented scale. The models, in their quest for token efficiency, are notorious for taking shortcuts, introducing egregious vulnerabilities that even a junior engineer would know to avoid. They are the digital equivalent of a contractor who uses masking tape to hold together a load-bearing wall. A recent study found that AI-generated code contains security flaws in a staggering 45% of cases. The very nature of their training on vast swathes of public code, much of which is itself insecure, creates a feedback loop of vulnerability amplification.

Compounding this problem is the rise of the "MBA engineer," the project manager who, armed with a powerful language model, suddenly fancies themselves a seasoned software architect. This phenomenon is already giving rise to a tidal wave of digital slop and technical debt that will plague organizations for years to come. Their lack of foundational knowledge leads them to accept the model's often-flawed output without question, creating systems that are brittle, opaque, and a maintenance catastrophe waiting to happen. We will all have to adapt to this new reality, a future where a significant portion of our time is spent cleaning up the digital messes left by these overconfident amateurs.

The non-deterministic and frankly lazy nature of these models further complicates matters. They are prone to hallucinating APIs, and will often insert placeholders or oversimplified logic in critical areas without so much as a warning, all in the name of minimizing their token output. A vague prompt like "make it work" is a recipe for bizarre and unpredictable behavior. The key to wielding these tools effectively lies in precise and detailed prompting, guiding the model toward a desirable solution rather than giving it free rein. Which at the moment, requires precise high-level software expertise that done by practitioners who know what they are doing.

One particularly exasperating quirk emerges when you task these models with the seemingly simple chore of fixing a failing unit test. Rather than engaging with the underlying business logic to diagnose the actual fault, the model will frequently opt for a path of duplicitous simplicity, surgically altering the code to hardcode a path that produces the exact expected test assertion output and nothing more. This is the coding equivalent of a student who, instead of learning the subject matter, simply memorizes the answer key for an exam. The resulting code is both silly and useless. It's a perfect, albeit infuriating, illustration of how these systems often extrapolate code without possessing a shred of genuine engineering best practices that most of us learned coding as teenagers.

For the discerning senior engineer, the path forward is clear. You might want to learn to use these tools tactically. These language models are not a threat to our livelihood, but a tool to be used tactically and with a healthy dose of skepticism. When combined with robust engineering practices—strong type systems, comprehensive code standard tooling, codebase indexing, and rigorous linters—their output can be significantly improved and their worst impulses curtailed. They can liberate us from the tedium of syntax and boilerplate, allowing us to operate at a higher level of abstraction, focusing on architecture and system design.

Whether these models will spontaneously evolve capacities of genuine engineers is an open research problem, and frankly, a matter of considerable doubt. There is mounting evidence that we are late on the S-curve of improvement through sheer scale, with behemoths like Google and Anthropic hemorrhaging tens of billions of dollars for a mere one percent gain on arbitrary benchmarks. These benchmarks, of course, correlate to little beyond the benchmark itself, serving mostly as a metaphorical ruler for corporate bragging rights, rather than any tangible profitability. The capital expenditure is patently unsustainable, a classic bubble inflated by venture capital optimism, especially when coding assistance represents one of the few verticals where these models are actually revenue-positive. Yet, if the history of software is any guide, the staggering costs will inevitably plummet. It is not difficult to imagine an equivalent of today's top-tier models running comfortably on a MacBook Pro with 64 gigabytes of memory within a few years, a feat likely to be achieved not by the current incumbents, but by some scrappy Chinese lab—the digital equivalent of Japanese car industry, where Toyota perfected profitability and product quality long after the Americans invented the automobile. Any suggestion of these models as an existential force seems increasingly silly; they are rapidly becoming commodity tools, far more akin to a Honda Civic than a nuclear weapon. We should welcome commodification and the bubble popping.

Beyond sheer technical utility and its manifest flaws, an inescapable ethical dimension to these tools warrants serious consideration. Many thoughtful engineers are choosing to become "AI-vegans," opting out entirely from a genuine disquietude about the provenance of these models. Their legitimate objections span from the murky data sources used for training to the exploitative labor practices required for their creation. Furthermore, a disquieting current of quasi-religious fervor emanates from the leadership of the very labs building these systems, a rationalist cult-like millenarian zeal that should give any clear-eyed observer pause. It is entirely possible, however, to sidestep this shift and remain a vital, successful software engineer. This path requires a deliberate focus on more boutique or artisanal domains of software, a kind of digital craftsmanship. A language model will not be writing secure Linux kernel modules anytime soon, or designing a niche programming language that falls outside its training corpus, or debugging the firmware on a deep-space probe. A fulfilling and quite lucrative career awaits in these specialized niches for those who wish to opt out of the language model ecosystem, because participation is not strictly necessary. The existence of Trader Joe's Two Buck Chuck did not displace the vintages of Bordeaux, neither will AI code displace the need for boutique expertise.

Ultimately, we are nowhere near the sci-fi scenarios of fully autonomous software engineers. These models are still just incredibly sophisticated pattern interpolators, high-dimensional word calculators that are useful, but shockingly far from having both the awareness and taste of a human engineer who can self-direct and conceive creatively. The real question is whether their proliferation will lead to a net good—a world where we can build better software faster—or an unmitigated disaster, a digital superfund site of unmaintainable and insecure code. Only time, and a great deal of experimentation and cleanup, will tell.