← Back to Blog
AI Field Guide · June 2026

The AI and ML Stack in 2026, and the Agent Road to 2027

A practical map of the techniques, frameworks, system layers, and agent infrastructure people are actually using now. Not a hype list. A stack with failure modes.

June 3, 2026 34 min read

Every AI stack diagram lies by omission.

The field is too wide now. "AI engineer" can mean tabular modeling, GPU kernel work, RAG pipelines, robotics control, eval harnesses, fine-tuning, model routing, agent observability, or one unlucky person maintaining all of it because the org chart was designed by vibes.

So here is the useful version.

As of June 2026, modern AI and ML is not one stack. It is seven interlocking layers: data and classical ML, foundation model architecture, training systems, post-training, retrieval and memory, agent orchestration, and inference operations.

The mistake is treating these as one job. They are not. They touch, but they fail differently.

The thesis

The frontier moved from "which model is smartest" to "which system can be trusted to act." The model still matters. The harness around the model now decides whether the thing survives users, permissions, latency, cost, and Tuesday.

Fig. 01 · The 2026 AI/ML stack FROM DATA TO SYSTEM BEHAVIOR Product and workflow layer UI, approvals, user intent, business rules, workflow constraints, human handoff Agent orchestration layer tool calling, workflow graphs, durable execution, MCP, A2A, sandboxes, identity Retrieval, memory, and context layer hybrid retrieval, reranking, graph memory, permissions, freshness, user state Serving, inference, and operations layer vLLM, SGLang, TensorRT-LLM, TGI, llama.cpp, Ollama, routing, caching, tracing Adaptation and post-training layer SFT, RLHF, DPO, GRPO, LoRA, QLoRA, distillation, synthetic data, eval loops Model architecture and training layer Transformers, MoE, FlashAttention, Mamba/SSMs, DiT, flow matching, VLMs, JEPA Data, classical ML, and governance base scikit-learn, XGBoost, LightGBM, feature stores, labels, data quality, access policy
The modern stack is not "LLM plus prompt." It is data, models, post-training, retrieval, agents, serving, and product orchestration welded together under latency, cost, quality, and permission constraints.

What the First Stack Diagrams Miss

Most diagrams still stop at "model plus app." That was fine when the product was a chat box. It is wrong when the product can search private data, call tools, edit files, trigger workflows, spend money, or ask another agent to do work.

The missing layer is control.

A useful AI system now has to answer seven boring questions before the model even gets interesting: what data can it see, what action can it take, who approved the action, what state persists, what happens if it fails, how do we replay the trace, and how do we know whether the answer was any good?

That is why the stack is getting more software-shaped. Frameworks are not just prompt wrappers anymore. The serious ones are becoming runtimes with persistence, tool schemas, human interrupts, sandboxing, evals, and observability.

Classical ML Is Not Dead

Start here because everyone keeps trying to bury it.

If your data is tabular, messy, medium-sized, structured, and tied to a business process, gradient boosted trees still eat deep learning for breakfast more often than people admit. Fraud models, churn models, pricing, demand forecasts, risk scoring, lead scoring, anomaly detection, ranking features, internal decision systems. You do not need a 70B parameter model to discover that invoices paid late tend to be paid late. You need clean features and a calibrated model.

The practical frameworks are boring in the good way: scikit-learn for standard ML pipelines, XGBoost and LightGBM for high-performance gradient boosting, plus Pandas, Polars, Spark, DuckDB, and whatever warehouse your data team has not managed to set on fire this quarter.

Classical ML techniques

  • Gradient boosted trees: still brutal on tabular data.
  • Feature engineering: less glamorous than models, more predictive than meetings.
  • Calibration: turns scores into probabilities humans can price.
  • Drift monitoring: catches reality changing under the model.

Frameworks that matter

scikit-learn XGBoost LightGBM Ray MLflow W&B Evidently

The rule is simple: if the target is structured and the stakes are operational, try the boring baseline first. If the boring baseline wins, keep it. Your users will not mourn the absence of a transformer.

The Foundation Model Layer

Deep learning's center of gravity is still the transformer. That has not changed. What changed is the amount of engineering wrapped around the transformer to make it trainable, serveable, and less financially offensive.

The 2026 architecture menu looks like this.

Transformers and attention

Still the default for language and many multimodal systems. FlashAttention-style kernels matter because attention is not just math. It is memory traffic wearing a math costume.

Transformers FlashAttention long context KV cache

Mixture of experts

MoE activates only part of the model per token. More parameters, similar active compute, more routing headaches. Very useful, very easy to make unstable.

MoE routing sparse compute

Mamba and state space models

SSMs attack the long-sequence cost of attention. They are not a clean transformer replacement everywhere, but they are real enough to keep in the map.

Mamba SSM linear time

Diffusion, DiT, and flow

Image, video, and audio generation shifted toward transformer-backed diffusion and flow matching. The U-Net did not vanish, but DiT and flow-based systems are now the serious scaling path.

Diffusion DiT flow matching rectified flow

Then there are multimodal systems: CLIP-style contrastive encoders, vision-language models, speech-language models, vision-language-action models, and JEPA-style world models. This is where the field is becoming less "text model with image input" and more "general sequence and world-state modeling."

Language is still the interface. It is no longer the whole substrate.

Fig. 02 · Architecture families MODERN MODEL ARCHITECTURE IS A SET OF TRADEOFFS Transformer best general-purpose bet expensive attention MoE more capacity per FLOP routing is the tax SSM / Mamba long sequence efficiency attention is not free Diffusion / Flow generate images, audio, video sample quality versus speed World Models predict consequences still early, very important Pick the architecture by bottleneck: data type, context length, compute, latency, action, and controllability.
The architecture choice is mostly a bottleneck choice. Transformers dominate, but MoE, SSMs, diffusion/flow, and world models all exist because some bottleneck got painful enough.

Training Frameworks: What People Actually Use

For training and research, the core frameworks are not mysterious.

This is not religious. Use PyTorch when you want the broadest ecosystem. Use JAX when transformations and large-scale research are the point. Use Keras when API cleanliness and backend flexibility matter. Use Hugging Face because reimplementing every model loader is not a personality trait.

The Framework Map Is Really a Failure-Mode Map

The wrong way to choose frameworks is by popularity. The useful way is by the failure mode you are trying to reduce.

If the risk is model quality

Use PyTorch, JAX, Hugging Face, Weights & Biases, MLflow, and strong eval suites. You are in training, fine-tuning, or experiment-management land.

If the risk is retrieval quality

Use LlamaIndex, LangChain, rerankers, hybrid search, metadata filters, graph retrieval, and source-quality evals. The vector database is only plumbing.

If the risk is agent control

Use LangGraph, OpenAI Agents SDK, Google ADK, Microsoft Agent Framework, Pydantic AI, CrewAI, Mastra, or Strands. Pick by language, state model, and operational constraints.

If the risk is inference cost

Use vLLM, SGLang, TensorRT-LLM, TGI, llama.cpp, Ollama, MLX, model routing, prompt trimming, caching, and smaller specialist models.

Notice the pattern: the framework is never the point. It is the shape of the failure you are willing to own.

Post-Training Is Where Models Become Products

Pretraining teaches a model the rough structure of the world. Post-training teaches it how not to behave like a stochastic autocomplete raised in a cave.

The modern post-training stack is layered:

This is why "which base model should we use?" is an incomplete question. A mediocre base model with good data, evals, retrieval, and post-training can beat a better base model wrapped in garbage.

The model is not the product. The behavior envelope is the product.
Fig. 03 · The post-training pipeline MODEL TRAINING DOES NOT END AT PRETRAINING Base model pretrained SFT task format Preference RLHF / DPO Reasoning RL GRPO / tasks LoRA / QLoRA cheap adaptation Distillation smaller model Quantization serve cheaper The loop is driven by evals. Without evals, post-training is astrology with GPUs.
Post-training is not one method. It is a behavior-shaping loop. The eval harness is the steering wheel.

RAG, Retrieval, and Memory

RAG became the default because models are finite, stale, and expensive. Retrieval lets you bring fresh or private context at runtime instead of trying to bake everything into weights.

The naive version is easy: chunk documents, embed chunks, store vectors, retrieve top-k, stuff context into prompt. It also fails in boring ways: bad chunking, stale documents, missing metadata, weak reranking, wrong citations, no permission model, and a prompt that says "use the context" as if the model signed a contract.

The useful version is more structured:

Frameworks here include LangChain, LangGraph, LlamaIndex, Haystack, DSPy, and vector stores like Qdrant, Milvus, Weaviate, Chroma, FAISS, and pgvector.

The framework matters less than the discipline. Retrieval systems are data systems. Treat them that way or enjoy debugging cosine similarity like it is a moral failing.

Agents Are Mostly Systems Engineering

An agent is not a model. It is a loop around a model.

The loop has state, tools, permissions, memory, retries, planning, evals, traces, human approval gates, and failure modes that do not show up in benchmark tables. This is why most demos work and most deployed agents disappoint. The demo only has to succeed once. Production has to survive Tuesday.

The modern agent frameworks are converging around the same ideas:

The serious frameworks now include OpenAI Agents SDK, Google ADK, LangGraph, Microsoft Agent Framework, Claude Agent SDK, Pydantic AI, CrewAI, Mastra, Strands Agents, LlamaIndex Workflows, DSPy, and smaller framework-specific runtimes around vertical products.

But the correct move is often not "add more agents." The correct move is "make the workflow explicit." Multi-agent systems are useful when the boundaries are real. Otherwise you have invented Slack for models.

The Agent Stack Is Becoming Its Own Stack

By mid-2026, agents are no longer just a prompt pattern. They are becoming an infrastructure category.

OpenAI is pushing a model-native agent harness with controlled workspaces, files, tools, and sandbox execution. Google ADK frames agent development as code-first software engineering with tools, evaluation, deployment, and A2A integration. LangGraph is betting on durable state, human interrupts, persistence, and long-running workflows. Microsoft is merging AutoGen and Semantic Kernel lineage into Agent Framework. Anthropic is exposing the Claude Code loop through Agent SDK. AWS Strands, Pydantic AI, CrewAI, Mastra, and others are all converging on the same brutal truth: an agent that cannot be traced, paused, resumed, permissioned, or evaluated is not production software.

Fig. 04 · Production agent runtime AN AGENT IS A RUNTIME, NOT A PROMPT Model reasoning, planning, language Tools APIs, files, browser, code State memory, checkpoints, tasks Protocols MCP, A2A, tool schemas Control plane identity, scopes, evals, traces, approvals
The agent runtime is the model plus everything that makes action survivable: tools, state, protocols, identity, permission scopes, evals, and traces.

Where Agents Go by the End of 2026

The end of 2026 will not be "fully autonomous AI employees." That phrasing belongs in vendor decks and other minor crime scenes.

The realistic end-of-2026 agent shape is narrower and more useful:

This is the part builders should pay attention to. Capability is improving fast, but deployment is being gated by governance, identity, cost, and reliability. Not glamour. Plumbing. The usual villain.

Where 2027 Stands

This section is a forecast, not prophecy. The evidence points in a clear direction, but anyone giving you a precise 2027 agent timeline is either selling software or avoiding a harder question.

By 2027, agents split into two worlds.

The first world is boring and valuable. Vertical, permissioned, auditable agents that operate inside known workflows. Coding agents that can work through issue queues. Research agents that gather evidence and cite sources. Support agents that triage, retrieve, draft, and escalate. Finance agents that reconcile records but need approval before moving money. These survive because the workflow is bounded and the ROI is legible.

The second world gets humbled. Broad autonomous agents with vague goals, broad access, weak evals, and no governance will get demoted or shut down. Gartner is already projecting that governance gaps will cause many enterprises to demote or decommission autonomous agents by 2027. That is not anti-agent. That is normal software growing up and discovering compliance departments.

Fig. 05 · Agent maturity ladder THE AGENT ROAD TO 2027 Assistant answers, drafts, summarizes 2025 baseline Task agent tools, memory, scoped action end 2026 Collaborative agents A2A, handoffs, artifacts 2027 mainstream Governed agent ecosystem identity, policy, audit, evals, rollback The winning agents are not the most autonomous. They are the best-scoped, best-instrumented, and easiest to stop.
The agent maturity curve is not just more autonomy. It is more bounded autonomy, better tools, stronger identity, and harder operational controls.

The durable 2027 pattern is not one agent doing everything. It is a governed mesh of small agents, tools, memories, and workflows. Each agent has a scope. Each scope has permissions. Each permission has logs. Each high-risk action has a checkpoint. Glamorous? No. Shippable? Yes.

The more autonomy you give the system, the more boring the surrounding software must become.

Inference Is the New Backend War

Once the model is good enough, the next bottleneck is serving it cheaply and reliably.

The key techniques are now standard vocabulary:

Frameworks: vLLM, SGLang, TensorRT-LLM, Hugging Face TGI, llama.cpp, Ollama, and MLX for Apple silicon. For cloud production, inference is now its own engineering discipline. For local AI, llama.cpp and Ollama made "run the model here" normal.

Fig. 06 · Inference bottlenecks INFERENCE IS MEMORY, LATENCY, AND UTILIZATION Request prompt + tools Prefill context processing Decode one token at a time Response stream + trace Continuous batching GPU utilization KV cache tricks PagedAttention, prefix Speculation draft then verify Routing cost control
Serving LLMs is mostly about keeping expensive hardware busy while users experience low latency. That is harder than it sounds, as all useful backend work is.

Evals, Observability, and the Quality Loop

The most important framework in an AI system is often not the model framework. It is the evaluation harness.

Without evals, every prompt change is a production experiment. Every model swap is superstition. Every agent improvement is a story told by the person who made it.

Modern AI evaluation has layers:

Frameworks here include MLflow, Weights & Biases, LangSmith, Langfuse, Arize Phoenix, Evidently, promptfoo, and custom eval suites. Custom is not a sin. Unmeasured is.

Security and Governance Are Architecture Now

Classic ML governance cared about data, model behavior, fairness, drift, and auditability. Agent governance has to care about all of that plus action.

An agent can read private data, browse hostile content, call tools, write files, send messages, update records, and trigger other agents. That combination changes the threat model. The OWASP Top 10 for Agentic Applications exists because autonomous systems fail differently from chatbots. The 2025 AI Agent Index found that the agent ecosystem is powerful, fast-moving, and inconsistently documented, especially around safety, evaluations, and social impact.

The practical governance model is simple enough to fit on a whiteboard:

Agent risk = data access x tool power x autonomy x blast radius.
Fig. 07 · Agent risk control loop GOVERNANCE HAS TO WRAP THE LOOP Identity who is acting? Permissions what can it do? Sandbox where can it fail? Audit what happened? Evals did it work? Rollback can we undo?
Agent governance is not just policy text. It has to become runtime behavior: identity, permission scopes, sandboxing, audit logs, evals, and rollback.

The security checklist is not exotic. Least privilege. Scoped tokens. Sandboxed code execution. Separate read and write permissions. Human approval for irreversible actions. Tool schemas that validate inputs and outputs. Memory poisoning defenses. Prompt injection isolation. MCP server review. A2A authentication and authorization. Logs that a human can actually read.

The weird part is not the controls. The weird part is that AI builders have to relearn enterprise security because the model can now ask for the dangerous thing in fluent English.

The Actual Decision Tree

If you are building in 2026, the practical question is not "what is the latest technique?" It is "what failure mode am I buying down?"

If your data is structured

Start with scikit-learn, XGBoost, or LightGBM. Add deep learning only when the baseline loses for a concrete reason.

If your task needs private knowledge

Build RAG with hybrid retrieval, metadata, reranking, source freshness, and permissions. The vector DB is not the whole system.

If your task needs action

Use an explicit workflow graph, typed tools, sandboxing, traces, and human checkpoints. Do not hide control flow in a paragraph.

If your model is too expensive

Try routing, caching, quantization, distillation, smaller models, vLLM/SGLang/TensorRT-LLM, and fewer unnecessary tokens.

That last one is underrated. Most AI systems waste tokens like a founder with fresh funding wastes SaaS seats. The cheapest inference optimization is often deleting prompt junk.

Where I Would Put My Attention

If you want to stay current without drowning, track these areas.

First, inference systems. Model intelligence will keep improving, but cost and latency decide what can ship. vLLM, SGLang, TensorRT-LLM, TGI, llama.cpp, Ollama, MLX, and model routing are not side quests. They are the bridge from demo to product.

Second, evals and observability. Agents make failure harder to see. Traces, regression tests, and quality metrics are the only way to prevent "the model got smarter" from becoming "the product got weirder."

Third, retrieval and memory. The next useful AI products are not generic chatbots. They are systems that know the user's world, the company's data, the current task, and what happened yesterday.

Fourth, post-training. The gap between base capability and useful behavior is still huge. DPO, GRPO, distillation, LoRA, synthetic data, and eval-driven fine-tuning are the knobs teams can actually turn.

Fifth, world models and multimodal action. This is earlier, but important. If agents are going to act reliably in physical or complex digital environments, they need to predict consequences, not just produce plausible next tokens.

The field is moving fast, but the shape is getting clearer.

Models are becoming components.

Data systems are coming back.

Evals are becoming non-negotiable.

Inference is becoming infrastructure.

And the real work is shifting from "can the model answer?" to "can the whole system behave?"

That is the stack now.

Sources