AI-assisted multi-agent software development has entered a mature, fast-moving phase in early 2026, with Claude Code and Codex CLI as the dominant terminal-first agents, a thriving ecosystem of orchestration tools built on tmux and git worktrees, and an emerging consensus on workflows that treat developers as engineering managers of AI fleets. The most consequential recent events include Peter Steinberger joining OpenAI in February 2026, Anthropic's release of Claude Opus 4.6 with Agent Teams, the AGENTS.md standard moving under Linux Foundation governance, and SWE-bench Verified scores cresting 80% for the first time. Meanwhile, Cursor's cloud agents and event-driven Automations, Kiro's spec-driven development, and Docker's microVM sandboxes signal paradigm shifts that may reshape the entire landscape. The review bottleneck—not code generation—is now the binding constraint, with Faros AI telemetry showing 91% longer code review times despite 98% more PRs merged.
Weekly research log
LatestAgentic.Dev
Weekly research on multi-agent AI development tools & workflows.
Which isolation model will win for multi-agent coding: worktrees, containers, or same-branch atomic commits?
Can review throughput improve fast enough to keep up with agent-generated code volume?
Will MCP remain dominant, or will CLI-first workflows keep taking share?
steinberger
Peter Steinberger & OpenClaw
Peter Steinberger (steipete), the Austrian developer who founded PSPDFKit and bootstrapped it for 13 years before a $100M+ exit in 2021, returned from retirement in April 2025 when he discovered AI's "paradigm shift." His open-source project OpenClaw (originally "Clawdbot," briefly "Moltbot" after Anthropic's legal team flagged trademark similarity to "Claude") became the fastest-growing project in GitHub history, surpassing 180,000+ stars by early February 2026. On February 14, 2026, Steinberger announced he was joining OpenAI, with Greg Brockman publicly welcoming him and Sam Altman calling him "a genius with a lot of amazing ideas about the future of very smart agents." OpenClaw transitioned to an independent foundation structure. His appearance on Lex Fridman podcast #491 (February 11-12, 2026) was a 3+ hour conversation covering OpenClaw's origin, the naming drama, security concerns, and his philosophy of agentic engineering. Steinberger's influence on the agentic coding community stems from three definitive blog posts: "My Current AI Dev Workflow" (August 2025), "Just Talk To It — the no-BS Way of Agentic Engineering" (October 2025), and "Shipping at Inference-Speed" (December 2025). Simon Willison featured the October post prominently.
+
steinberger
Peter Steinberger & OpenClaw
Peter Steinberger (steipete), the Austrian developer who founded PSPDFKit and bootstrapped it for 13 years before a $100M+ exit in 2021, returned from retirement in April 2025 when he discovered AI's "paradigm shift." His open-source project OpenClaw (originally "Clawdbot," briefly "Moltbot" after Anthropic's legal team flagged trademark similarity to "Claude") became the fastest-growing project in GitHub history, surpassing 180,000+ stars by early February 2026. On February 14, 2026, Steinberger announced he was joining OpenAI, with Greg Brockman publicly welcoming him and Sam Altman calling him "a genius with a lot of amazing ideas about the future of very smart agents." OpenClaw transitioned to an independent foundation structure. His appearance on Lex Fridman podcast #491 (February 11-12, 2026) was a 3+ hour conversation covering OpenClaw's origin, the naming drama, security concerns, and his philosophy of agentic engineering. Steinberger's influence on the agentic coding community stems from three definitive blog posts: "My Current AI Dev Workflow" (August 2025), "Just Talk To It — the no-BS Way of Agentic Engineering" (October 2025), and "Shipping at Inference-Speed" (December 2025). Simon Willison featured the October post prominently.
Peter Steinberger (steipete), the Austrian developer who founded PSPDFKit and bootstrapped it for 13 years before a $100M+ exit in 2021, returned from retirement in April 2025 when he discovered AI's "paradigm shift." His open-source project OpenClaw (originally "Clawdbot," briefly "Moltbot" after Anthropic's legal team flagged trademark similarity to "Claude") became the fastest-growing project in GitHub history, surpassing 180,000+ stars by early February 2026. On February 14, 2026, Steinberger announced he was joining OpenAI, with Greg Brockman publicly welcoming him and Sam Altman calling him "a genius with a lot of amazing ideas about the future of very smart agents." OpenClaw transitioned to an independent foundation structure.
Research ReportHis appearance on Lex Fridman podcast #491 (February 11-12, 2026) was a 3+ hour conversation covering OpenClaw's origin, the naming drama, security concerns, and his philosophy of agentic engineering. Steinberger's influence on the agentic coding community stems from three definitive blog posts: "My Current AI Dev Workflow" (August 2025), "Just Talk To It — the no-BS Way of Agentic Engineering" (October 2025), and "Shipping at Inference-Speed" (December 2025). Simon Willison featured the October post prominently.
Research ReportHis workflow philosophy centers on several distinctive positions. He runs 3–8 Codex CLI instances in parallel in a 3×3 terminal grid, most operating in the same folder on main. He explicitly rejects git worktrees: "I experimented with worktrees, PRs but always revert back to this setup as it gets stuff done the fastest." Instead, his agents make atomic commits, each listing only the files it touched. His ~800-line AGENTS.md file, which he calls "organizational scar tissue" — "I didn't write it, codex did, and anytime sth happens I ask it to make a concise note in there" — lives in his shared agent-scripts repository with downstream repos pointing to it.
Research ReportHe is strongly anti-MCP despite having written five himself: "Almost all MCPs really should be CLIs... Use GitHub's MCP and see 23K tokens gone." He prefers CLIs because the agent naturally discovers usage via help menus without paying constant context costs. His "blast radius thinking" guides task decomposition: "When I think of a change I have a pretty good feeling about how long it'll take and how many files it will touch. I can throw many small bombs at my codebase or one 'Fat Man' and a few small ones."
Research ReportSteinberger transitioned from Claude Code to Codex CLI between August and October 2025, citing frustration with Claude's sycophantic tone: "I used to love Claude Code, these days I can't stand it anymore... the absolutely right's, the 100% production ready messages while tests fail." He calls vibe coding "a slur" and insists on "agentic engineering" as the proper term. His open-source tools include Peekaboo (macOS screenshot CLI for AI visual QA, 2.7K stars), mcporter (package MCPs as CLIs, 2.7K stars), Oracle (invoke GPT-5 Pro with custom context, 1.6K stars), CodexBar (menu bar usage stats), tmuxwatch (Charmbracelet TUI for tmux monitoring), and Trimmy (flatten multi-line shell snippets). He spends approximately $1K/month on 4 OpenAI subscriptions plus 1 Anthropic subscription.
Research Report claude-code
Claude Code Ecosystem
Claude Code's most significant evolution in early 2026 is the experimental Agent Teams feature, launched alongside Claude Opus 4.6 on February 4-5, 2026. Agent Teams enables a team lead session to spawn teammate agents, each with its own context window, coordinated through a shared file-backed task list with dependency tracking and a mailbox system using JSON-appended inbox files. File-lock-based claiming prevents race conditions. Teammates do not inherit the lead's conversation history—they start fresh with only the spawn prompt—and token costs run approximately 5× per teammate. The feature requires setting CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1. Opus 4.6 itself brought 200K default context (1M in beta), 128K token output (doubled from 64K), adaptive thinking as the new recommended mode, and effort parameter GA with three levels (low, medium, high). It scored 80.8% on SWE-bench Verified and 90.2% on BigLaw Bench. Anthropic claims it "uncovered 500 zero-day vulnerabilities," a claim that drew community scrutiny.
+
claude-code
Claude Code Ecosystem
Claude Code's most significant evolution in early 2026 is the experimental Agent Teams feature, launched alongside Claude Opus 4.6 on February 4-5, 2026. Agent Teams enables a team lead session to spawn teammate agents, each with its own context window, coordinated through a shared file-backed task list with dependency tracking and a mailbox system using JSON-appended inbox files. File-lock-based claiming prevents race conditions. Teammates do not inherit the lead's conversation history—they start fresh with only the spawn prompt—and token costs run approximately 5× per teammate. The feature requires setting CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1. Opus 4.6 itself brought 200K default context (1M in beta), 128K token output (doubled from 64K), adaptive thinking as the new recommended mode, and effort parameter GA with three levels (low, medium, high). It scored 80.8% on SWE-bench Verified and 90.2% on BigLaw Bench. Anthropic claims it "uncovered 500 zero-day vulnerabilities," a claim that drew community scrutiny.
Claude Code's most significant evolution in early 2026 is the experimental Agent Teams feature, launched alongside Claude Opus 4.6 on February 4-5, 2026. Agent Teams enables a team lead session to spawn teammate agents, each with its own context window, coordinated through a shared file-backed task list with dependency tracking and a mailbox system using JSON-appended inbox files. File-lock-based claiming prevents race conditions. Teammates do not inherit the lead's conversation history—they start fresh with only the spawn prompt—and token costs run approximately 5× per teammate. The feature requires setting CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1.
Research ReportOpus 4.6 itself brought 200K default context (1M in beta), 128K token output (doubled from 64K), adaptive thinking as the new recommended mode, and effort parameter GA with three levels (low, medium, high). It scored 80.8% on SWE-bench Verified and 90.2% on BigLaw Bench. Anthropic claims it "uncovered 500 zero-day vulnerabilities," a claim that drew community scrutiny.
Research ReportThe native --worktree flag is now built into Claude Code, activated via claude --worktree feature-auth or -w. It creates .claude/worktrees/<name>/ with an auto-generated branch, cleans up automatically if no changes are made, and prompts for keep/remove otherwise. Custom subagents can specify isolation: worktree in their YAML frontmatter. Boris Cherny, Claude Code's creator, called worktrees his "number one productivity tip," running 3-5 simultaneously—a direct contrast with Steinberger's anti-worktree stance.
Research ReportHeadless mode (-p / --print) supports three output formats: text, JSON (with metadata including total_cost_usd and session_id), and stream-json (NDJSON). The --json-schema flag enables structured output via constrained decoding, guaranteeing schema compliance. Session resumption works via --continue (most recent) or --resume <session-id>, with sessions stored at ~/.claude/projects/<encoded-cwd>/<session-id>.jsonl.
Research ReportThe Claude Agent SDK, available in both Python (claude-agent-sdk) and TypeScript (@anthropic-ai/claude-agent-sdk), provides programmatic access to Claude Code's agent loop, built-in tools (Read, Write, Edit, Bash, Glob, Grep, WebSearch, WebFetch), custom tools via in-process MCP servers, and hooks (PreToolUse, PostToolUse, SubagentStop, SessionEnd). Microsoft announced Agent Framework integration.
Research ReportCLAUDE.md best practices have crystallized around three scopes: global (~/.claude/CLAUDE.md), project (./CLAUDE.md), and local (.claude/CLAUDE.md). The @path/to/file.md import syntax supports recursive imports up to 5 levels deep. HumanLayer's recommendation of under 60 lines for the root file has become community consensus, with research suggesting frontier LLMs can follow ~150-200 instructions with reasonable consistency but quality degrades uniformly as count increases. The WHY/WHAT/HOW hierarchy places universal rules in CLAUDE.md, on-demand playbooks in .claude/skills/, and deep reference in docs/agent-guides/. The .claude/rules/*.md directory provides auto-loaded conditional rules with paths frontmatter.
Research ReportMarch 2026 updates include the /loop command (recurring prompts on intervals), cron scheduling within sessions, 20-language voice STT support, an ExitWorktree tool, a Code Review tool (research preview using a team of agents to crawl codebases and rank bugs by severity), and Cowork (desktop preview bringing agentic capabilities to knowledge work beyond coding, running in an isolated VM). Official docs have migrated to code.claude.com/docs/en/ and platform.claude.com/docs/en/.
Research Report codex-cli
Codex CLI Ecosystem
Codex CLI reached v0.113.0 on March 10, 2026, now fully rewritten in Rust from its original TypeScript/Node.js/React (Ink) stack. The Rust rewrite, announced in June 2025 and now the default, delivers zero-dependency installation (no Node.js required), native security bindings, optimized performance without garbage collection, and an extensible protocol for multi-language extensions. Current flagship models include GPT-5.4 and GPT-5.3-Codex, with GPT-5.3-Codex-Spark running on Cerebras WSE-3 at 1,000+ tokens/sec (15× faster) available as a research preview. The sandbox architecture is OS-native: Apple Seatbelt via sandbox-exec with runtime-compiled profiles on macOS, Landlock + seccomp with vendored Bubblewrap on Linux, and native restricted tokens on Windows (promoted from experimental in v0.100.0). The .git/ directory is protected as read-only. Network access is binary (all or nothing) on macOS, with a known issue (#10390) where network_access = true in config.toml is silently ignored by Seatbelt. Three sandbox modes operate independently of the approval policy: read-only, workspace-write (default for auto), and danger-full-access. The --full-auto flag combines on-request approval with workspace-write sandbox. The --dangerously-bypass-approvals-and-sandbox flag (alias --yolo) removes all protections.
+
codex-cli
Codex CLI Ecosystem
Codex CLI reached v0.113.0 on March 10, 2026, now fully rewritten in Rust from its original TypeScript/Node.js/React (Ink) stack. The Rust rewrite, announced in June 2025 and now the default, delivers zero-dependency installation (no Node.js required), native security bindings, optimized performance without garbage collection, and an extensible protocol for multi-language extensions. Current flagship models include GPT-5.4 and GPT-5.3-Codex, with GPT-5.3-Codex-Spark running on Cerebras WSE-3 at 1,000+ tokens/sec (15× faster) available as a research preview. The sandbox architecture is OS-native: Apple Seatbelt via sandbox-exec with runtime-compiled profiles on macOS, Landlock + seccomp with vendored Bubblewrap on Linux, and native restricted tokens on Windows (promoted from experimental in v0.100.0). The .git/ directory is protected as read-only. Network access is binary (all or nothing) on macOS, with a known issue (#10390) where network_access = true in config.toml is silently ignored by Seatbelt. Three sandbox modes operate independently of the approval policy: read-only, workspace-write (default for auto), and danger-full-access. The --full-auto flag combines on-request approval with workspace-write sandbox. The --dangerously-bypass-approvals-and-sandbox flag (alias --yolo) removes all protections.
Codex CLI reached v0.113.0 on March 10, 2026, now fully rewritten in Rust from its original TypeScript/Node.js/React (Ink) stack. The Rust rewrite, announced in June 2025 and now the default, delivers zero-dependency installation (no Node.js required), native security bindings, optimized performance without garbage collection, and an extensible protocol for multi-language extensions. Current flagship models include GPT-5.4 and GPT-5.3-Codex, with GPT-5.3-Codex-Spark running on Cerebras WSE-3 at 1,000+ tokens/sec (15× faster) available as a research preview.
Research ReportThe sandbox architecture is OS-native: Apple Seatbelt via sandbox-exec with runtime-compiled profiles on macOS, Landlock + seccomp with vendored Bubblewrap on Linux, and native restricted tokens on Windows (promoted from experimental in v0.100.0). The .git/ directory is protected as read-only. Network access is binary (all or nothing) on macOS, with a known issue (#10390) where network_access = true in config.toml is silently ignored by Seatbelt. Three sandbox modes operate independently of the approval policy: read-only, workspace-write (default for auto), and danger-full-access. The --full-auto flag combines on-request approval with workspace-write sandbox. The --dangerously-bypass-approvals-and-sandbox flag (alias --yolo) removes all protections.
Research ReportAGENTS.md became an open standard under the Agentic AI Foundation (AAIF), announced December 9, 2025, stewarded by the Linux Foundation. Co-founders include Anthropic (contributing MCP), Block (contributing Goose), and OpenAI (contributing AGENTS.md). Platinum members span AWS, Google, Microsoft, Bloomberg, and Cloudflare. Over 60,000 open-source projects now use AGENTS.md, supported by Codex, Cursor, GitHub Copilot, Gemini CLI, Windsurf, JetBrains Junie, and many others. Hierarchical discovery walks from Git root to CWD, checking AGENTS.override.md then AGENTS.md at each level, with a configurable 32 KiB combined size limit (project_doc_max_bytes). The override file enables temporary changes (release freezes, incidents) without modifying the base file. OpenAI purchased the agents.md domain.
Research ReportThe config.toml profile system supports named profiles activated via codex --profile deep-review, with config resolution following CLI flags → profile → project config → user config → defaults. The /review slash command opens diff-based review presets with an optional review_model override. Other confirmed commands include /status, /model, /plan, /permissions, /init, /resume, /fork, /compact, and /skills. Custom slash commands via ~/.codex/prompts/ are now deprecated in favor of the Skills system. The codex exec command (alias codex e) provides non-interactive mode for CI/CD, with --json for JSONL event streaming, --output-schema for structured validation, and --ephemeral to skip session persistence. A dedicated GitHub Action (openai/codex-action@v1) and TypeScript SDK (@openai/codex-sdk) support programmatic integration.
Research ReportRecent additions include the Codex App (macOS desktop, Windows added March 4, 2026) with multi-agent management and auto-worktrees, Codex Cloud (codex cloud exec for async tasks with best-of-N runs), a curated plugin marketplace (v0.113.0), streaming stdin/stdout/stderr for command execution, and a request_permissions tool for runtime permission escalation. Fast mode is now enabled by default with TUI indicators.
Research Report orchestration
Orchestration & Infrastructure Tools
The orchestration layer for multi-agent coding has converged on tmux + git worktrees as foundational infrastructure, with several tools offering distinct approaches. Claude Squad (github.com/smtg-ai/claude-squad), written in Go, is the flagship orchestrator—it pairs tmux sessions with git worktrees so each agent operates on its own branch with zero runtime interference. It supports Claude Code, Codex, Gemini CLI, Aider, and OpenCode via the -p flag, with auto-accept mode (-y) for background completion. workmux (github.com/raine/workmux), written in Rust by Raine Virta, takes an opinionated "one worktree = one tmux window" approach with .workmux.yaml config files, agent status icons in window names, auto-detection of built-in agents, and an LLM-based auto branch name generator. It also supports Kitty, WezTerm, and Zellij. NTM (Named Tmux Manager, github.com/Dicklesworthstone/ntm) transforms tmux into a full command center, spawning named panes for each agent type (ntm spawn myproject --cc=3 --cod=2 --gmi=1) with broadcast prompts by type, a visual TUI dashboard, automated context rotation, and YAML pipeline definitions. CCManager (github.com/kbwo/ccmanager) takes a self-contained approach requiring no tmux dependency—it manages its own PTY sessions with preset switching between agents, AI-powered auto-approval via Haiku, and devcontainer integration for sandboxed development.
+
orchestration
Orchestration & Infrastructure Tools
The orchestration layer for multi-agent coding has converged on tmux + git worktrees as foundational infrastructure, with several tools offering distinct approaches. Claude Squad (github.com/smtg-ai/claude-squad), written in Go, is the flagship orchestrator—it pairs tmux sessions with git worktrees so each agent operates on its own branch with zero runtime interference. It supports Claude Code, Codex, Gemini CLI, Aider, and OpenCode via the -p flag, with auto-accept mode (-y) for background completion. workmux (github.com/raine/workmux), written in Rust by Raine Virta, takes an opinionated "one worktree = one tmux window" approach with .workmux.yaml config files, agent status icons in window names, auto-detection of built-in agents, and an LLM-based auto branch name generator. It also supports Kitty, WezTerm, and Zellij. NTM (Named Tmux Manager, github.com/Dicklesworthstone/ntm) transforms tmux into a full command center, spawning named panes for each agent type (ntm spawn myproject --cc=3 --cod=2 --gmi=1) with broadcast prompts by type, a visual TUI dashboard, automated context rotation, and YAML pipeline definitions. CCManager (github.com/kbwo/ccmanager) takes a self-contained approach requiring no tmux dependency—it manages its own PTY sessions with preset switching between agents, AI-powered auto-approval via Haiku, and devcontainer integration for sandboxed development.
The orchestration layer for multi-agent coding has converged on tmux + git worktrees as foundational infrastructure, with several tools offering distinct approaches. Claude Squad (github.com/smtg-ai/claude-squad), written in Go, is the flagship orchestrator—it pairs tmux sessions with git worktrees so each agent operates on its own branch with zero runtime interference. It supports Claude Code, Codex, Gemini CLI, Aider, and OpenCode via the -p flag, with auto-accept mode (-y) for background completion. workmux (github.com/raine/workmux), written in Rust by Raine Virta, takes an opinionated "one worktree = one tmux window" approach with .workmux.yaml config files, agent status icons in window names, auto-detection of built-in agents, and an LLM-based auto branch name generator. It also supports Kitty, WezTerm, and Zellij.
Research ReportNTM (Named Tmux Manager, github.com/Dicklesworthstone/ntm) transforms tmux into a full command center, spawning named panes for each agent type (ntm spawn myproject --cc=3 --cod=2 --gmi=1) with broadcast prompts by type, a visual TUI dashboard, automated context rotation, and YAML pipeline definitions. CCManager (github.com/kbwo/ccmanager) takes a self-contained approach requiring no tmux dependency—it manages its own PTY sessions with preset switching between agents, AI-powered auto-approval via Haiku, and devcontainer integration for sandboxed development.
Research ReportThree separate projects share the "Amux" name: mixpeek's Amux provides a heavy-duty web dashboard with agent-to-agent orchestration via REST API and shared global memory; andyrewlee's Amux offers a clean TUI with headless CLI mode and a job queue system; and hewigovens's Amux is a minimal CLI wrapper. Notable newer tools include Agentboard (browser-based tmux GUI optimized for agent TUIs with iOS Safari support), tmux-agent-indicator (visual pane border feedback for agent states), Clash (Rust CLI detecting merge conflicts across worktrees using git merge-tree), and Context Manager (macOS menubar app for monitoring Claude Code sessions with git branch drift detection).
Research ReportIn the Neovim ecosystem, codecompanion.nvim (~5,600 stars) is the most comprehensive plugin, supporting both HTTP adapters (Anthropic, OpenAI, Gemini, DeepSeek, Ollama, and many others) and Agent Client Protocol adapters (Claude Code, Codex, Gemini CLI, OpenCode, Kiro). ThePrimeagen's "99" (~4,200 stars) takes a deliberately constrained approach "for people without skill issues," offering visual mode code replacement, fill-in-function mode, and TreeSitter context awareness. agentic.nvim implements ACP directly in Neovim with zero-config authentication and session persistence. lazygit (73.9K stars) paired with git-delta provides syntax-highlighted diffs with clickable line numbers that open editors at exact locations—essential for reviewing agent-generated changes.
Research ReportKaushik Gopal's agent forking pattern (kau.sh/blog/agent-forking/) represents the minimalist philosophy: a Bash script using tmux to fork subagents from a main session, auto-summarizing long transcripts before feeding to the fork, and supporting cross-agent forking (Codex for planning → Claude Code for coding → Gemini for diagrams). VoiceMode (github.com/mbailey/voicemode) is an open-source MCP server providing natural two-way voice conversations with Claude Code using local Whisper + Kokoro TTS. The commercial alternative Wispr Flow (wisprflow.ai) claims 4× faster input than typing at 184 WPM, with context-aware formatting and SOC 2/HIPAA compliance.
Research Report alternatives
Alternative Approaches & New Entrants
The Claude Code / Codex CLI duopoly faces pressure from multiple directions. Gemini CLI (github.com/google-gemini/gemini-cli) reached 1M+ developers with Gemini 3 Pro's 1M token context window and a generous free tier (60 requests/min, 1,000/day). Gemini 3 Flash achieved 78% on SWE-bench, and Gemini 3.1 Pro leads Terminal-Bench 2.0 at 78.4%, overtaking Codex. The tool supports GEMINI.md for project customization, MCP integration, and Google Search grounding as a unique differentiator. Cursor ($29.3B valuation, ~$2B annual revenue) introduced Automations on March 5, 2026—always-on agents triggered by Slack, Linear, GitHub, PagerDuty, webhooks, or schedules, with each agent spinning up a cloud sandbox and learning from past runs via a memory tool. Cloud Agents (February 2026) give each agent an isolated VM, running 10-20 in parallel and producing merge-ready PRs with videos and screenshots. Cursor reports 35% of its own PRs are now generated by cloud agents. The acquisition of code review startup Graphite signals where the constraint truly lives.
+
alternatives
Alternative Approaches & New Entrants
The Claude Code / Codex CLI duopoly faces pressure from multiple directions. Gemini CLI (github.com/google-gemini/gemini-cli) reached 1M+ developers with Gemini 3 Pro's 1M token context window and a generous free tier (60 requests/min, 1,000/day). Gemini 3 Flash achieved 78% on SWE-bench, and Gemini 3.1 Pro leads Terminal-Bench 2.0 at 78.4%, overtaking Codex. The tool supports GEMINI.md for project customization, MCP integration, and Google Search grounding as a unique differentiator. Cursor ($29.3B valuation, ~$2B annual revenue) introduced Automations on March 5, 2026—always-on agents triggered by Slack, Linear, GitHub, PagerDuty, webhooks, or schedules, with each agent spinning up a cloud sandbox and learning from past runs via a memory tool. Cloud Agents (February 2026) give each agent an isolated VM, running 10-20 in parallel and producing merge-ready PRs with videos and screenshots. Cursor reports 35% of its own PRs are now generated by cloud agents. The acquisition of code review startup Graphite signals where the constraint truly lives.
The Claude Code / Codex CLI duopoly faces pressure from multiple directions. Gemini CLI (github.com/google-gemini/gemini-cli) reached 1M+ developers with Gemini 3 Pro's 1M token context window and a generous free tier (60 requests/min, 1,000/day). Gemini 3 Flash achieved 78% on SWE-bench, and Gemini 3.1 Pro leads Terminal-Bench 2.0 at 78.4%, overtaking Codex. The tool supports GEMINI.md for project customization, MCP integration, and Google Search grounding as a unique differentiator.
Research ReportCursor ($29.3B valuation, ~$2B annual revenue) introduced Automations on March 5, 2026—always-on agents triggered by Slack, Linear, GitHub, PagerDuty, webhooks, or schedules, with each agent spinning up a cloud sandbox and learning from past runs via a memory tool. Cloud Agents (February 2026) give each agent an isolated VM, running 10-20 in parallel and producing merge-ready PRs with videos and screenshots. Cursor reports 35% of its own PRs are now generated by cloud agents. The acquisition of code review startup Graphite signals where the constraint truly lives.
Research ReportOpenCode (opencode.ai, 95,000+ GitHub stars) has emerged as the standout open-source alternative with a polished Bubble Tea TUI, 75+ LLM providers via AI SDK, LSP integration, and GitHub/GitLab integrations where mentioning /opencode in issues triggers automated work. The Cline → Roo Code → Kilo Code fork chain represents rapid open-source evolution: Cline (5M+ installs) added native subagents in v3.58; Roo Code forked with custom modes (Architect/Coder/Debugger) and SOC 2 compliance; Kilo Code forked from both with an Orchestrator mode that routes complex tasks to specialized sub-agents, secured an $8M seed round, and reached 1.5M+ users.
Research ReportKiro (kiro.dev), AWS's spec-driven IDE, generates structured specs from prompts before any code is written, uses Agent Hooks (event-driven automations on file save/create/delete), and employs property-based testing with "shrinking" for quality validation. Its Autonomous Agent (preview) works independently across multiple repos with persistent context. Devin (devin.ai) by Cognition occupies the fully autonomous extreme at $500/month for 250 ACUs, with Nubank reporting 12× efficiency improvements on large migrations.
Research ReportDocker Sandboxes now use microVM-based isolation (not just containers), supporting Claude Code, Gemini CLI, Codex CLI, and Kiro natively via docker sandbox run <agent>. Container Use by Dagger (github.com/dagger/container-use) provides an open-source MCP server giving each agent its own container plus git worktree. E2B uses Firecracker microVMs with <200ms sandbox launch. The isolation hierarchy runs from microVMs (strongest, ~125ms boot) through gVisor (10-30% I/O overhead) to hardened containers (trusted code only).
Research ReportCapy (capy.ai), a YC-backed IDE, is architecturally interesting as the only tool designed from scratch for parallel execution: a Captain agent plans while Build agents implement, each in a dedicated cloud VM with git worktrees, supporting up to 25 agents in parallel. mini-swe-agent from Princeton/Stanford proves that ~100 lines of Python can score >74% on SWE-bench Verified, establishing an important baseline for evaluating scaffolding overhead.
Research Report best-practices
Community Patterns & Best Practices
Spec-first development has become the dominant paradigm. GitHub released an open-source Spec Kit toolkit, Kiro is built entirely around specs, and Addy Osmani's O'Reilly guide recommends specs covering commands, testing, project structure, code style, git workflow, and boundaries with three-tier classifications (Always/Ask first/Never). Martin Fowler's team (Birgitta Böckeler) offers the important caveat that "the term 'spec-driven development' isn't very well defined yet." Plan Mode in Claude Code (Shift+Tab twice or /plan) restricts the agent to read-only operations for analyzing codebases and creating plans before execution. Boris Cherny uses it at the start of most sessions: "start in Plan Mode, go back and forth until the plan is right, then switch to Auto-Accept and let Claude execute." The recommended four-phase workflow is Explore → Plan → Implement → Commit. The opusplan alias routes Opus for planning and Sonnet for execution as a cost optimization.
+
best-practices
Community Patterns & Best Practices
Spec-first development has become the dominant paradigm. GitHub released an open-source Spec Kit toolkit, Kiro is built entirely around specs, and Addy Osmani's O'Reilly guide recommends specs covering commands, testing, project structure, code style, git workflow, and boundaries with three-tier classifications (Always/Ask first/Never). Martin Fowler's team (Birgitta Böckeler) offers the important caveat that "the term 'spec-driven development' isn't very well defined yet." Plan Mode in Claude Code (Shift+Tab twice or /plan) restricts the agent to read-only operations for analyzing codebases and creating plans before execution. Boris Cherny uses it at the start of most sessions: "start in Plan Mode, go back and forth until the plan is right, then switch to Auto-Accept and let Claude execute." The recommended four-phase workflow is Explore → Plan → Implement → Commit. The opusplan alias routes Opus for planning and Sonnet for execution as a cost optimization.
Spec-first development has become the dominant paradigm. GitHub released an open-source Spec Kit toolkit, Kiro is built entirely around specs, and Addy Osmani's O'Reilly guide recommends specs covering commands, testing, project structure, code style, git workflow, and boundaries with three-tier classifications (Always/Ask first/Never). Martin Fowler's team (Birgitta Böckeler) offers the important caveat that "the term 'spec-driven development' isn't very well defined yet."
Research ReportPlan Mode in Claude Code (Shift+Tab twice or /plan) restricts the agent to read-only operations for analyzing codebases and creating plans before execution. Boris Cherny uses it at the start of most sessions: "start in Plan Mode, go back and forth until the plan is right, then switch to Auto-Accept and let Claude execute." The recommended four-phase workflow is Explore → Plan → Implement → Commit. The opusplan alias routes Opus for planning and Sonnet for execution as a cost optimization.
Research ReportContext management is the most critical operational skill. Claude Code's 200K token window degrades around 147K-152K tokens, with auto-compaction triggering at approximately 83.5% capacity. System prompts, tool definitions, and MCP schemas consume 30-40K tokens before the user types anything. The /compact command summarizes conversation history (lossy after 3-4 applications), while the "Document & Clear" pattern—dumping progress into a .md file, clearing context, starting fresh—provides a manual checkpoint. Community practice favors 30-45 minute focused sessions over marathon 2+ hour sessions.
Research ReportModel routing across tiers delivers the single biggest cost optimization. Anthropic's own Explore agent runs on Haiku by default—a signal that should not be ignored. The consensus routes Haiku ($1/$5 per MTok) to code review, documentation, linting, and subagent tasks; Sonnet 4.6 ($3/$15) to daily development covering 80-90% of work, scoring 79.6% on SWE-bench Verified; and Opus 4.6 ($5/$25) to architecture decisions, complex refactoring, and orchestration. Infralovers documented a 57% cost reduction from implementing model routing in a 9-agent architecture. Average developer cost runs approximately $6/day, with 90% under $12/day.
Research ReportThe engineering manager mindset has become the defining mental model. Addy Osmani's influential January 2026 post states: "AI coding at scale stops being a prompting problem and becomes a management problem." MIT's Missing Semester course now includes agentic coding, framing it as "one helpful mental model might be that of a manager of an intern." The two-agent verification pattern (Agent A implements, Agent B reviews) provides separation of concerns. WIP limits on active agent streams prevent review drowning. Simon Willison's observation that "the natural bottleneck isn't generating code—it's reviewing it" is borne out by Faros AI data: across 10,000+ developers, PR size grew 154% and review time increased 91%, while organizational DORA delivery metrics remained flat.
Research Report benchmarks
Benchmarks & Comparisons
SWE-bench Verified scores crossed 80% for the first time in early 2026. The current leaderboard is topped by Claude Opus 4.5 (Thinking) at 80.9%, followed by Opus 4.6 (Thinking) at 80.8%, Gemini 3.1 Pro at 80.6%, MiniMax M2.5 (the leading open-weight model) at 80.2%, and GPT-5.2 at 80.0%. Sonnet 4.6 at 79.6% delivers near-flagship performance at roughly half the Opus cost. On SWE-bench Pro (multi-language, harder), GPT-5.3-Codex leads at 56.8%. On Terminal-Bench 2.0 (March 2026), Gemini 3.1 Pro overtook Codex at 78.4% versus 77.3%. Each vendor strategically highlights benchmarks where it leads—OpenAI avoids self-reporting SWE-bench Verified (Python-only), Anthropic avoids SWE-bench Pro (multi-language). Token efficiency remains Codex CLI's strongest advantage. Head-to-head testing shows Codex uses 3-4× fewer tokens per task: on a scheduler task, Claude Code consumed 234,772 tokens versus Codex's 72,579 (3.2×); on a Figma clone, Claude Code used 6.2M tokens versus Codex's 1.5M (4.2×). Combined with lower per-token pricing (GPT-5 Codex at $1.25/$10 per MTok versus Sonnet at $3/$15), the cost-per-task gap is substantial.
+
benchmarks
Benchmarks & Comparisons
SWE-bench Verified scores crossed 80% for the first time in early 2026. The current leaderboard is topped by Claude Opus 4.5 (Thinking) at 80.9%, followed by Opus 4.6 (Thinking) at 80.8%, Gemini 3.1 Pro at 80.6%, MiniMax M2.5 (the leading open-weight model) at 80.2%, and GPT-5.2 at 80.0%. Sonnet 4.6 at 79.6% delivers near-flagship performance at roughly half the Opus cost. On SWE-bench Pro (multi-language, harder), GPT-5.3-Codex leads at 56.8%. On Terminal-Bench 2.0 (March 2026), Gemini 3.1 Pro overtook Codex at 78.4% versus 77.3%. Each vendor strategically highlights benchmarks where it leads—OpenAI avoids self-reporting SWE-bench Verified (Python-only), Anthropic avoids SWE-bench Pro (multi-language). Token efficiency remains Codex CLI's strongest advantage. Head-to-head testing shows Codex uses 3-4× fewer tokens per task: on a scheduler task, Claude Code consumed 234,772 tokens versus Codex's 72,579 (3.2×); on a Figma clone, Claude Code used 6.2M tokens versus Codex's 1.5M (4.2×). Combined with lower per-token pricing (GPT-5 Codex at $1.25/$10 per MTok versus Sonnet at $3/$15), the cost-per-task gap is substantial.
SWE-bench Verified scores crossed 80% for the first time in early 2026. The current leaderboard is topped by Claude Opus 4.5 (Thinking) at 80.9%, followed by Opus 4.6 (Thinking) at 80.8%, Gemini 3.1 Pro at 80.6%, MiniMax M2.5 (the leading open-weight model) at 80.2%, and GPT-5.2 at 80.0%. Sonnet 4.6 at 79.6% delivers near-flagship performance at roughly half the Opus cost. On SWE-bench Pro (multi-language, harder), GPT-5.3-Codex leads at 56.8%. On Terminal-Bench 2.0 (March 2026), Gemini 3.1 Pro overtook Codex at 78.4% versus 77.3%. Each vendor strategically highlights benchmarks where it leads—OpenAI avoids self-reporting SWE-bench Verified (Python-only), Anthropic avoids SWE-bench Pro (multi-language).
Research ReportToken efficiency remains Codex CLI's strongest advantage. Head-to-head testing shows Codex uses 3-4× fewer tokens per task: on a scheduler task, Claude Code consumed 234,772 tokens versus Codex's 72,579 (3.2×); on a Figma clone, Claude Code used 6.2M tokens versus Codex's 1.5M (4.2×). Combined with lower per-token pricing (GPT-5 Codex at $1.25/$10 per MTok versus Sonnet at $3/$15), the cost-per-task gap is substantial.
Research ReportThe task routing consensus has solidified: Codex CLI excels at prototyping, quick fixes, terminal/DevOps tasks, and multi-language work; Claude Code leads for complex refactoring, architecture decisions, multi-file reasoning, and computer use (72.7% on relevant benchmarks). Gemini CLI's 1M context window makes it strongest for large-codebase analysis and documentation tasks. The earlier claim of Claude Code at 72.7% on SWE-bench Verified specifically referred to Claude Sonnet 4's initial score—both platforms have since improved dramatically.
Research ReportA striking data point: Claude Code now authors approximately 4% of all public GitHub commits (~135,000/day), with SemiAnalysis projecting 20%+ by end of 2026. Third-party scaffolds like Verdent demonstrate that agent scaffolding matters enormously—their framework outperforms both Claude Code and Codex using the same underlying models, suggesting the scaffolding race may matter more than the model race.
Research ReportPeekaboo
Steinberger transitioned from Claude Code to Codex CLI between August and October 2025, citing frustration with Claude's sycophantic tone: "I used to love Claude Code, t…
Mentioned in the report’s Peter Steinberger's journey from retirement to OpenAI section.Claude Agent SDK
The Claude Agent SDK, available in both Python (claude-agent-sdk) and TypeScript (@anthropic-ai/claude-agent-sdk), provides programmatic access to Claude Code's agent lo…
Mentioned in the report’s Claude Code matures with Agent Teams, worktrees, and an SDK section.Codex App
Recent additions include the Codex App (macOS desktop, Windows added March 4, 2026) with multi-agent management and auto-worktrees, Codex Cloud (codex cloud exec for asy…
Mentioned in the report’s Codex CLI ships fast with Rust rewrite and Linux Foundation backing section.Claude Squad
The orchestration layer for multi-agent coding has converged on tmux + git worktrees as foundational infrastructure, with several tools offering distinct approaches.
Mentioned in the report’s Orchestration tools form a vibrant tmux-native ecosystem section.Gemini CLI
The Claude Code / Codex CLI duopoly faces pressure from multiple directions.
Mentioned in the report’s Alternatives are mounting a serious challenge section.Spec Kit
GitHub released an open-source Spec Kit toolkit, Kiro is built entirely around specs, and Addy Osmani's O'Reilly guide recommends specs covering commands, testing, proje…
Mentioned in the report’s Community consensus crystallizes around five core practices section.Oracle
Steinberger transitioned from Claude Code to Codex CLI between August and October 2025, citing frustration with Claude's sycophantic tone: "I used to love Claude Code, t…
Mentioned in the report’s Peter Steinberger's journey from retirement to OpenAI section.Code Review tool
March 2026 updates include the /loop command (recurring prompts on intervals), cron scheduling within sessions, 20-language voice STT support, an ExitWorktree tool, a Co…
Mentioned in the report’s Claude Code matures with Agent Teams, worktrees, and an SDK section.“paradigm shift.”
Extracted from the manually supplied narrative research report.
“Clawdbot,”
Extracted from the manually supplied narrative research report.
“Moltbot”
Extracted from the manually supplied narrative research report.
“Claude”
Extracted from the manually supplied narrative research report.
“a genius with a lot of amazing ideas about the future of very smart agents.”
Extracted from the manually supplied narrative research report.
“My Current AI Dev Workflow”
Extracted from the manually supplied narrative research report.