Here's the thing about AI in 2026: the competition has gotten insane. Not just incrementally more intense, but fundamentally different in character. We're watching a leapfrogging game where the most recent model is almost always the best model, and the window of relevance keeps shrinking. Working on AI systems at NeoSapien, I've had a front-row seat to this acceleration. What follows is my attempt to make sense of it.
The DeepSeek moment earlier in 2025 changed everything. A Chinese company released an open-weight reasoning model with near state-of-the-art performance, allegedly trained on much less compute for much cheaper. It wasn't just a technical achievement. It was a narrative shift. The idea that compute moats would protect incumbent labs? That story got a lot harder to tell.
The New Competition Landscape
Let's be precise about what's actually happening. There's no company today with access to technology that no other company has access to. Researchers rotate between labs. Ideas propagate through papers, through Twitter, through the inevitable leaks. The differentiating factor isn't knowledge. It's resources and execution speed.
What DeepSeek kicked off was a movement within China similar to how ChatGPT kicked off a movement in the US. Suddenly there are dozens of Chinese tech companies releasing very strong frontier open-weight models. DeepSeek isn't even the clear leader anymore: Zhipu AI's GLM models, MiniMax, Kimi Moonshot, Qwen from Alibaba. The list keeps growing.
The Chinese models are open weights with genuinely unrestricted licenses. No user limits, no reporting requirements. Compare that to Llama or Gemma where there are strings attached if you exceed certain thresholds. This matters for enterprise adoption. It matters for building on top of these models.
The business models that American AI companies are running could be at risk. People in the US pay for AI software. Historically in China and other parts of the world, people don't. If Chinese companies are giving away near-frontier capabilities for free, the subscription model faces real pressure.
Who's Actually Winning?
Different labs are betting on different things, and it's showing. Anthropic has bet hard on code. Claude Code is genuinely changing how developers work. The latest Claude Opus models have earned their reputation. Google's Gemini had its moment but somehow faded from the conversation. OpenAI keeps landing things despite appearing chaotic operationally.
Here's what I use personally: ChatGPT for quick information lookups, the fast model. Gemini for simple queries or things I could Google. Claude Opus for code and any sort of philosophical discussion, always with extended thinking. Grok 4 Heavy for hardcore debugging that other models can't solve. Each model has found a niche where it wins my usage.
The interesting observation: everyone I talk to who's doing serious work uses multiple models. The threshold effect is real: a model does something smart and you fall in love with it, then it does something dumb and you switch. You use it until it breaks.
The Architecture Question
Here's something that surprises people: from GPT-2 to today's frontier models, the fundamental architecture hasn't changed that much. It's still the transformer. Still autoregressive. Still attention mechanisms and feedforward layers.
The differences are in the details: Mixture of Experts (MoE) to pack more knowledge without proportional compute costs, Multi-head Latent Attention for more economical long-context handling, Group Query Attention, RMSNorm instead of LayerNorm, different position encodings. But you can literally start with GPT-2 code and modify it step by step to get to Llama 3 or DeepSeek V3. It's the same lineage.
The transformer lineage: GPT-2 to frontier models
Mixture of Experts (MoE)
Instead of one fully connected feedforward network, you have multiple "experts" with a router that selects which experts to use for each token. You pack more knowledge into the network, but not all of it is used for every token. DeepSeek V3 uses 256 experts but only activates a subset per forward pass. This is why a 671B parameter model can run with the compute of a much smaller dense model.
What Actually Changed
The gains aren't in architecture. They're in training methodology and systems engineering. FP8 and FP4 training, better data mixtures, more sophisticated parallelism strategies. Labs are figuring out how to utilize more compute efficiently, which lets them train faster and iterate on data and algorithms.
Think about it: tokens per second per GPU is a metric that defines large-scale training. You can go from 10K to 13K tokens/second by turning on FP8 training. That's a 30% speedup from a systems optimization, not an algorithmic breakthrough. These efficiency gains compound into faster experimentation, which compounds into better models.
Scaling Laws: Dead or Alive?
The discourse around scaling laws has gotten confused. Let me try to clarify.
A scaling law is the power law relationship between compute/data (the x-axis) and held-out prediction accuracy (the y-axis). This relationship has held for 13 orders of magnitude of compute. The technical question of whether putting in more pre-training compute produces better models: that's still working. The question is whether it's economically sensible.
There are now three axes of scaling:
- Pre-training scaling: Bigger models, more data. The traditional approach. Still works, just expensive.
- RL scaling: Longer reinforcement learning runs with verifiable rewards. The breakout of 2025.
- Inference-time scaling: Letting the model generate more tokens on a specific problem. Why o1 and R1 feel so different.
The low-hanging fruit has shifted. Pre-training a bigger model is expensive and the gains are getting harder. But RL with verifiable rewards? The scaling curve there is still steep. OpenAI's o1 showed that if you log-increase RL compute, you get linear increases in evaluation scores. That's a juicy curve to climb.
Pre-training is a fixed cost: you spend the compute once and the capability is permanent. Inference-time scaling costs money per query. The math changes based on how many users you have, how long your model will be in market, and what your margins look like. ChatGPT with 100 million users has different economics than a specialized enterprise API.
The Training Pipeline
Here's how modern LLM training actually works:
Pre-training
Classic next-token prediction on massive internet corpora. But the data has evolved. It's not just raw text anymore. There's synthetic data: rephrasing, Q&A reformatting, summarization. The intuition is that a model learning from well-structured text gets there faster than one parsing messy Reddit posts.
Pre-training datasets are measured in trillions of tokens. Smaller research models might use 5-10 trillion. Qwen reportedly went to 50 trillion. The rumored frontier is 100 trillion. Getting this data involves massive filtering pipelines, OCR systems for PDFs, and careful deduplication.
Mid-training
Same algorithm as pre-training, but on more specialized data. Long-context documents. Reasoning traces. Code. The idea is to give the model the skills it needs for post-training to work.
This is where contamination becomes a real issue. If your mid-training set includes problems that look like your evaluation benchmarks, you're not measuring what you think you're measuring. Qwen has faced scrutiny here. There are papers showing the base model produces suspiciously high-precision answers to math problems when only the numbers are changed.
Post-training: RLVR
Reinforcement Learning with Verifiable Rewards is the breakthrough paradigm. You give the model a problem with a known correct answer. You let it try many times. You reward it for getting the right answer. You don't constrain how it gets there.
Emergent Reasoning Behavior
In DeepSeek R1's training, something remarkable happened: the model started self-correcting. It would generate a solution, recognize a mistake, explicitly say "let me try again," and revise. This wasn't explicitly trained. It emerged from the simple reward signal of "did you get the right answer." The longer they trained, the longer the responses got. More tokens, more self-correction, higher accuracy.
I've run experiments with RLVR on small models. A Qwen 2.5 base model with 15% accuracy on MATH 500 can hit 50% accuracy in 50 training steps. Literally a few minutes of compute. The model isn't learning new math knowledge in 50 steps. The knowledge was already there from pre-training. RLVR is unlocking it.
Post-training: RLHF
Reinforcement Learning from Human Feedback is still the finishing touch. It makes models more useful: better organization, appropriate style, helpful tone. This is what made ChatGPT feel magical when it launched.
But RLHF has a ceiling. You're optimizing for aggregate human preferences, which means averaging across many people's opinions. There's no scaling law for RLHF. You can't just throw more compute at it and get proportionally better results. It's about matching style, not learning capabilities.
Modern LLM training pipeline and relative time/cost allocation
The Tool Use Problem
If there's one thing that will determine the near-term trajectory of AI impact, it's tool use. The ability for models to search the web, run code, call APIs, interact with software. This is where the unlock happens.
Right now, tool use is mostly on the proprietary LLM side. Claude Code is changing how developers work. ChatGPT does web searches. But for open-weight models, the ecosystem isn't quite ready. And the trust problem is real: do you want to give an LLM permission to execute commands on your computer? To touch your files?
The promise is huge. Instead of having the LLM memorize "what is 23 plus 5," just use a calculator. Instead of trying to remember who won the 1998 World Cup, do a search. Hallucination is largely a symptom of asking models to do things they should be delegating to tools.
Open models want to be useful for multiple tools and use cases. Closed models deeply integrate specific tools into the experience. This creates a gap: Claude can seamlessly search, run code, manage files. An open model downloaded from Hugging Face requires you to build all that infrastructure yourself.
Computer use, having an LLM operate a full graphical interface, has been demoed but it still sucks. Multiple labs showed demos in 2024 and they're all unreliable. This is harder than API-based tool use because you're dealing with arbitrary interfaces, visual grounding, long sequences of actions. The models aren't good at it yet, and the research to make them good is expensive.
The Coding Revolution
Recent surveys of professional developers (10+ years experience) show something striking: both junior and senior developers are shipping AI-generated code. Not just playing with it. Shipping it. And senior developers are more likely to ship code that's majority AI-generated.
About 80% of developers find it either somewhat or significantly more enjoyable to use AI as part of their work.
This matches my experience. The AI isn't replacing the joy of programming. It's removing the mundane friction. The Bash script you need in 10 seconds because your wife is waiting in the car. The website tweak you don't enjoy. The boilerplate you've written a hundred times.
But there's a Goldilocks zone. If you use the LLM to do all your coding, the thing you love is no longer there. You become a manager of something that codes for you. Two years of that, eight hours a day: do you still feel fulfilled?
Code doesn't lie. It's math, basically. With a math textbook, you might not notice mistakes because you're not running the proofs. With code, if it works, you know it's correct. There's no misunderstanding.
The real question for education: how do you become an expert if you never struggle? The way I learned was by trying things myself, by debugging, by the joy of finally finding the bug. If LLMs are always there to shortcut the struggle, where does expertise come from?
Timeline to AGI (and Why It's the Wrong Question)
The discourse around AGI timelines is confused because nobody agrees on definitions. But here's a useful framing: the remote worker test. Can AI do most digital tasks that a remote worker could do? That's a concrete milestone.
Language models today are immensely powerful but they're not remote worker drop-ins. They can't learn from feedback the way an employee does. They don't update their weights based on your correction. The continual learning problem is unsolved.
Jagged Intelligence
Models are excellent at some things (writing code for websites, explaining concepts) and surprisingly bad at others (distributed ML systems, visual reasoning). The gap is real.
Software Engineer Assistants
Most software development becomes AI-assisted. Simple apps, websites, data analysis become fully automatable. Complex systems still need humans.
Specialized Breakthroughs
AlphaFold-style moments in specific domains. Scientific discovery accelerates. But general "drop-in remote worker" remains elusive.
Automated Research
AI that can genuinely do novel AI research. This requires solving taste, creativity, and long-horizon planning. Harder than it looks.
The thing that would actually matter, automated AI research leading to recursive self-improvement, requires new ideas. The current paradigm of pre-training plus RLVR plus inference scaling can go far, but at some point you need fundamentally different approaches. Value functions, process reward models, maybe architectures beyond transformers. We don't know when those breakthroughs come.
What Actually Matters
Let me end with what I think people are underweighting:
Making human knowledge accessible. The difference between Google Search and an LLM is bigger than people acknowledge. I can ask anything and get an answer. Understanding my own life, figuring out career trajectories, learning about human history. This isn't just in the US. It's kids throughout the world being able to learn. The long-term impact of that permeating everything is probably where the real transformation happens.
Physical goods and in-person experience. As AI slop floods the internet, there will be increasing premium on the physical, on the real. Seeing each other. Talking in person. Artifacts that a human made. The slop is only starting. The next few years will be worse before society snaps out of it and the physical becomes precious again.
Individual suffering in transitions. When we talk about jobs being automated, we have to remember each lost job is a human being's suffering. That's real tragedy at the individual level. Economic arguments about new jobs being created don't help the person whose job just got eliminated. We can't lose sight of that as we build these systems.
Agency and community. These are what humans actually want. The ability to do meaningful things with people you care about. AI can augment this or it can hollow it out. Which path we take isn't determined by the technology. It's determined by choices we make about how to deploy it.
In 100 years, will people remember the transformer architecture? The scaling laws? Probably not the details. They'll remember computing, broadly. The connectivity. The democratization of knowledge. The question is whether they'll remember it as the thing that helped humanity flourish or the thing that made us forget what it meant to be human.
Building AI systems at NeoSapien, I think about this constantly. We're not trying to replace human cognition. We're trying to augment it. The thesis is simple: instead of replacing what makes humans valuable, enhance it. The attention mechanisms in neural networks can complement human attentional processes, not compete with them.
Whether that thesis is right, whether we can actually build systems that enhance rather than diminish human agency: that's the question that matters. The scaling curves and benchmark numbers are just means to an end.
What end are we building toward?