Gamers Forem: Om Shree

Cursor Just Released Composer 2.5. Here's What Actually Changed for AI Coding Agents.

Om Shree — Thu, 21 May 2026 15:00:59 +0000

Cursor has spent the last year moving from “AI coding assistant” into something much more ambitious: a vertically integrated agentic software engineering stack. Yesterday’s release of Composer 2.5 makes that direction impossible to ignore.

This is not just a faster autocomplete model. Cursor is explicitly optimizing for long-horizon coding agents that can plan, execute, recover from failures, and stay coherent across large multi-step engineering tasks.

The Problem It's Solving

Most coding models still break the moment a task stops being local.

They can generate a React component, patch a bug, or refactor a function. But once the task becomes multi-file, infrastructure-heavy, or operationally ambiguous, the cracks show quickly. Context drifts. Tool calls fail. The model loops. Terminal sessions become chaotic. Long-running execution loses coherence.

That is the real bottleneck in agentic software engineering right now.

Cursor says Composer 2.5 was specifically trained to improve “long-horizon agentic tasks” and follow complex instructions more reliably. The company also claims substantial behavioral improvements around effort calibration, communication style, and execution consistency. (Cursor)

This matters because the next phase of AI coding is no longer about code generation quality alone. It is about whether agents can operate inside real engineering environments without constantly collapsing under state management and execution complexity.

How Composer 2.5 Actually Works

Under the hood, Composer 2.5 continues Cursor’s strategy from Composer 2: domain-specialized reinforcement learning for software engineering workflows.

Cursor’s technical report for Composer 2 describes a two-stage training pipeline:

Continued pretraining on a base model
Large-scale reinforcement learning inside real software engineering environments and agent harnesses (arXiv)

The important detail is not the benchmark number. It is the training environment.

Cursor is training models directly inside the same operational harness used by deployed coding agents — including terminals, tools, multi-step execution chains, and realistic repository interactions. That creates a feedback loop where the model is optimized for actual agent workflows instead of isolated benchmark prompts. (arXiv)

Composer 2.5 reportedly improves:

Long-running task reliability
Multi-step execution planning
Instruction adherence
Agent communication behavior
Effort calibration during coding workflows (The Indian Express)

There is another important layer here: infrastructure economics.

Composer 2 originally gained attention because Cursor delivered strong coding performance at dramatically lower token costs than frontier proprietary models. Cursor positioned it as a cheaper alternative to systems from Anthropic and OpenAI. (Cursor)

That pricing advantage came with controversy.

After launch, developers discovered Composer 2 was built on top of Moonshot AI's open-weight Kimi K2.5 model. Cursor later acknowledged this publicly and admitted it should have disclosed the base model earlier. (Business Insider)

Composer 2.5 reportedly still builds on the same Kimi base checkpoint, but Cursor is increasingly differentiating through RL infrastructure, agent training environments, and deployment tooling rather than raw foundational pretraining. (The Indian Express)

That is a very different strategy from the “train everything from scratch” approach most frontier labs market publicly.

What Developers Are Actually Using It For

The interesting part about Cursor’s recent releases is that they increasingly resemble operational AI infrastructure rather than a standalone IDE.

Over the last few months, Cursor has launched:

Cursor SDK for programmatic agents
Cloud development environments for agents
Bugbot autonomous debugging systems
Multi-agent execution workflows
Cursor 3, a broader agentic workspace layer (Cursor)

Composer 2.5 sits in the middle of that stack.

The target use case is no longer “help me write code faster.” It is:

Autonomous repository maintenance
Long-running refactors
Infrastructure migration workflows
Multi-step debugging
Agent-managed terminal execution
PR generation and validation
Extended software tasks that may run for hours

That direction aligns closely with where the broader MCP and agentic ecosystem is heading.

The future competitive advantage is not just model intelligence. It is orchestration quality: tool reliability, memory handling, execution recovery, context persistence, and operational safety across long-running workflows.

This is exactly why infrastructure companies like Gentoro and MCP ecosystem players like Glama.ai matter increasingly in the stack. Models are becoming interchangeable faster than orchestration layers are.

Why This Is a Bigger Deal Than It Looks

Cursor is quietly proving something the broader AI market still underestimates:

Specialized agent training may matter more than raw frontier scale for real-world developer workflows.

Composer 2.5 is not trying to be a universal reasoning model. It is being optimized aggressively for software execution environments.

That shift has major implications.

The AI coding market is rapidly splitting into two layers:

Foundation model providers
Agent orchestration and execution platforms

Cursor appears to be betting the second layer becomes more defensible over time.

That also explains why the company is investing heavily in infrastructure. Reports indicate Cursor plans to train Composer 2.5 using xAI compute infrastructure with tens of thousands of GPUs. (Business Insider)

The strategic signal here is important:
AI coding is moving from “chatbot in an editor” toward persistent software agents operating inside full execution environments.

And once that happens, infrastructure quality becomes the actual moat.

Availability and Access

Composer 2.5 is now available through Cursor.

The release follows Cursor’s broader push into autonomous coding systems and arrives during intensifying competition from Claude Code, OpenAI, and other agentic developer tooling platforms. (WIRED)

The bigger story is not whether Composer 2.5 wins a benchmark cycle. It is that Cursor is steadily building an operational stack for autonomous software engineering.

The IDE war is turning into an agent infrastructure war.

Follow for more coverage on MCP, agentic AI, and AI infrastructure.

Microsoft Just Framed MCP as Part of the Open Agentic Stack. Here's What That Actually Means.

Om Shree — Thu, 21 May 2026 14:57:44 +0000

For years, Microsoft’s open source strategy was mostly about cloud adoption and developer ecosystems. At Open Source Summit North America 2026, the company made something much bigger clear: it now sees open protocols and agent infrastructure as the next foundational layer of computing.

And buried inside that announcement was the real signal for the MCP ecosystem.

The Problem It's Solving

Right now, most AI agents are still trapped inside fragmented execution environments.

Every framework has its own tooling model. Every cloud vendor has its own orchestration stack. Tool access, memory handling, governance, and runtime execution are all implemented differently depending on the platform. That fragmentation becomes a serious problem once agents move from demos into production infrastructure.

Microsoft’s latest messaging is essentially acknowledging that agentic systems need the equivalent of what Kubernetes became for containers: portable infrastructure primitives and open interoperability standards. ([Microsoft Open Source][1])

That is where MCP starts becoming strategically important.

Model Context Protocol was initially framed as a standardized interface for connecting models to tools and external systems. But the ecosystem around it has evolved rapidly. MCP is increasingly becoming a shared interoperability layer for agent execution environments, tool routing, UI delivery, and cross-platform orchestration. ([Wikipedia][2])

Microsoft’s Open Source Summit announcement strongly suggests the company understands that shift.

How Microsoft's Open Agentic Stack Actually Works

In its official summit post, Microsoft explicitly described its vision around “frameworks, protocols, and governance for AI agents” and repeatedly emphasized the need for agents to operate “across frameworks, clouds, languages, and runtimes.” ([Microsoft Open Source][1])

That wording matters.

The company is no longer talking about isolated copilots. It is talking about infrastructure portability.

Microsoft’s announcement centered around several major layers:

Azure Linux 4.0
Azure Container Linux
Open governance tooling
Secure software supply chain infrastructure
Open agentic system interoperability ([SDxCentral][3])

The Linux layer is especially important here.

Microsoft says Azure Linux 4.0 is being positioned as a hardened operating system foundation for cloud-native and AI-native workloads. The company also confirmed that Linux infrastructure now underpins large parts of Azure’s AI stack, including services tied to GitHub, Microsoft 365, and ChatGPT-scale deployments. ([Cloud Native Now][4])

That changes the MCP conversation significantly.

MCP servers do not exist in isolation. Real-world deployment requires:

Sandboxed execution
Tool governance
Authentication layers
Runtime isolation
Observability
Secure software supply chains
Container orchestration
Cross-agent communication

In other words, MCP adoption eventually becomes an infrastructure problem, not just a protocol problem.

Microsoft’s summit positioning suggests the company increasingly sees agent interoperability and runtime portability as core platform primitives — similar to how Kubernetes standardized container orchestration a decade ago.

What Developers Are Actually Building With MCP

The MCP ecosystem has quietly moved far beyond simple tool calling.

Developers are now using MCP to build:

Multi-agent orchestration systems
Secure enterprise tool gateways
Agent memory layers
Remote execution environments
Interactive AI application interfaces
Cross-model tool portability
Agent observability pipelines

That evolution is happening fast across the open ecosystem.

The Agentic AI Foundation recently positioned MCP as a key interoperability layer for “secure, scalable agentic AI systems” operating across tools, models, and platforms. ([Linux Foundation][5])

At the same time, infrastructure companies are racing to operationalize the stack around it.

Platforms like Glama.ai are increasingly focused on MCP gateway quality, discoverability, and secure tool integration. Companies like Gentoro are working on orchestration and infrastructure layers for enterprise-grade agent systems.

This is the important shift:
the protocol itself is becoming less valuable than the operational ecosystem forming around it.

And Microsoft appears to be positioning Azure directly underneath that future stack.

Why This Is a Bigger Deal Than It Looks

The most important part of Microsoft’s announcement was not Linux.

It was the company openly framing agentic AI as an open systems problem rather than a proprietary model problem.

That is a major strategic distinction.

The AI industry spent the last two years competing almost entirely on model intelligence. But production agent systems introduce a completely different bottleneck:

interoperability
execution reliability
governance
runtime security
infrastructure portability
software supply chain trust

Those are open infrastructure problems.

That is also why Open Source Summit North America 2026 heavily centered discussions around AI infrastructure, supply chain security, embedded systems, and agentic AI on the same stage. ([Cloud Native Now][6])

The ecosystem is converging around a new reality:
agents are becoming distributed systems.

And distributed systems historically standardize around open protocols faster than proprietary interfaces.

That creates a very favorable environment for MCP.

Availability and Access

Microsoft’s announcements around Azure Linux 4.0 and Azure Container Linux were unveiled during Open Source Summit North America 2026, with broader rollout activity expected around Microsoft Build. ([SDxCentral][3])

The more important takeaway is strategic:
Microsoft is increasingly treating agent infrastructure as a first-class cloud layer.

And once cloud vendors start organizing around open agent interoperability, MCP stops looking like a niche protocol and starts looking like foundational infrastructure.

Follow for more coverage on MCP, agentic AI, and AI infrastructure.

Google Just Shipped Gemini 3.5 Flash. Here's What Developers Actually Need to Know.

Om Shree — Thu, 21 May 2026 14:26:01 +0000

The Flash series has always been Google's answer to the speed-vs-intelligence tradeoff. With Gemini 3.5 Flash, Google is making a different argument: you shouldn't have to choose.

The Problem It's Solving

The history of "fast" AI models is a history of compromise. You got low latency, but you gave up reasoning depth. You got cheaper inference, but you got worse results on multi-step tasks. The whole Flash premise — intelligence at Flash-level speed and cost — has always been aspirational. With Gemini 3.5 Flash, the benchmarks suggest Google has actually closed a meaningful portion of that gap, particularly for the workload that matters most right now: agentic execution.

How Gemini 3.5 Flash Actually Works

Gemini 3.5 Flash is designed for sub-agent deployment, multi-step workflows, and long-horizon tasks at scale, with particular effectiveness in rapid agentic loops involving complex coding cycles and iterations. That's the framing Google leads with, and the architecture reflects it.

The model supports a 1M token context window, 65k max output tokens, and thinking — the same set of tools and platform features as Gemini 3 Flash. The key architectural addition is thought preservation: the model now maintains intermediate reasoning across multi-turn conversations automatically. When present in the conversation history, reasoning context carries forward, which improves performance on complex multi-step tasks like iterative debugging and code refactoring. No API changes are needed.

The thinking system itself has also changed. The default thinking effort level is now medium, changed from high in Gemini 3 Flash Preview. medium yields very good results across a wide range of tasks while being faster and more cost-efficient. For complex problems, high encourages the model to think more deeply. Google's explicit recommendation: start at medium, drop to low for speed-sensitive agentic loops, escalate to high only for hard reasoning or math. The old thinking_budget numeric parameter is gone — use the thinking_level string enum instead.

One important note for teams running computer-use workloads: Computer Use is not supported in Gemini 3.5 Flash at this moment. For Computer Use workloads, continue using Gemini 3 Flash Preview.

What Developers Are Actually Using It For

The benchmark most worth examining for this audience is MCP Atlas — a multi-step workflows benchmark using MCP. Gemini 3.5 Flash scores 83.6% on MCP Atlas, leading the comparison set that includes Gemini 3.1 Pro (78.2%), Claude Opus 4.7 (79.1%), and GPT-5.5 (75.3%). If you're building anything involving MCP tool chains, that number is directly relevant.

On Finance Agent v2 (financial analysis and decision-making), Gemini 3.5 Flash scores 57.9%, ahead of Claude Sonnet 4.6 (51.0%), Claude Opus 4.7 (51.5%), and GPT-5.5 (51.8%).

The coding story is also compelling in a specific way. JetBrains reports that Gemini 3.5 Flash delivers coding and reasoning quality close to Gemini Pro while preserving the speed and cost profile that makes Flash ideal for real-time developer workflows, with low-reasoning coding performance improved by 10–20% compared to the previous Flash generation.

Enterprise validation comes from Box: Gemini 3.5 Flash beat Gemini 3 Flash by 19.6% on Box's enterprise work evaluation set, which was designed to reflect the kinds of real-world multi-step tasks their customers perform daily. For Life Sciences customers, Gemini 3.5 Flash can extract data and make calculations with 96.4% greater accuracy, and for Financial Services firms, it can build financial reports from structured data with 46.7% greater accuracy.

Why This Is a Bigger Deal Than It Looks

The MCP Atlas score deserves more attention than it's getting. For anyone building agentic systems using the Model Context Protocol — and the infrastructure around it is growing fast — having a model that leads on multi-step MCP workflows at Flash pricing changes the economics of what you can deploy. MCP-native tooling like Glama.ai and other agentic middleware layers become more viable when your inference costs stay low without sacrificing orchestration quality.

The thought preservation feature is the other architectural shift worth watching. Most developers managing multi-turn agentic sessions today are manually engineering state — reconstructing context, summarizing prior steps, managing memory externally. With Gemini 3.5 Flash, the model uses reasoning context from all previous turns when thought signatures are present in the conversation history; the SDKs handle this automatically. That's less scaffolding code your team has to maintain.

There is one behavioral change that could silently degrade quality if you migrate without testing: the default thinking effort changed from high to medium. Teams should verify quality, speed, and cost after migration, and note that thought preservation is now on by default — reasoning context carries forward across turns, which improves performance but may increase token usage.

Availability and Access

Gemini 3.5 Flash is generally available (GA), stable, and ready for scaled production use. The model ID is gemini-3.5-flash, last updated May 2026.

The model is accessible via the Gemini App, Gemini API, Google AI Studio, Google Antigravity, Gemini Enterprise Agent Platform, and Android Studio. It supports function calling, structured output, search grounding, Google Maps grounding, URL context, file search, code execution, and thinking — all available in the same request via combined tool use.

On the paid tier, input pricing runs $1.50 per million tokens and output at $9.00 per million tokens (including thinking tokens). Context caching is $0.15 per million tokens, with storage at $1.00 per million tokens per hour. Batch inference halves those rates. A free tier is available for experimentation through Google AI Studio.

For teams migrating from Gemini 3 Flash Preview: update the model string from gemini-3-flash-preview to gemini-3.5-flash, replace thinking_budget with thinking_level, remove temperature/top_p/top_k from your config (no longer recommended), and add id and matching name to all FunctionResponse parts. The full migration checklist is worth reading before touching production.

The speed-vs-intelligence tradeoff that has defined the Flash tier since its inception is getting smaller with each generation. The MCP Atlas score, the thought preservation architecture, and the enterprise validation from Box all point at the same conclusion: Gemini 3.5 Flash is the most credible case yet that "fast and cheap" doesn't have to mean "less capable" for agentic workloads specifically.

Follow for more coverage on MCP, agentic AI, and AI infrastructure.

Manifold Security Just Scored 7,700 MCP Servers. Here's Why That Number Should Worry You.

Om Shree — Wed, 20 May 2026 09:46:12 +0000

The MCP ecosystem grew faster than anyone could audit it. Now there's a tool trying to catch up — and what it's finding isn't reassuring.

The Problem It's Solving

When Model Context Protocol became the de facto standard for connecting AI agents to external tools and data, adoption moved at a pace the security industry wasn't ready for. Every major agent platform built in MCP support. Registries filled up. Enterprises started wiring agents to internal systems through servers they'd never vetted.

The supply chain problem with traditional software took years to become obvious. With MCP, the same pattern is playing out in months. And the threat model is nastier than a bad npm package.

A compromised MCP server doesn't just exfiltrate data. It can control an agent's reasoning, redirect its execution, and manipulate its decisions at the tool-call layer — before the output ever reaches a human. That's a different category of exposure than a vulnerable dependency. You're not patching a library. You're potentially handing an attacker the steering wheel of an autonomous system.

How Manifest's Scoring Actually Works

Manifold Security has expanded its Manifest supply chain intelligence platform to cover MCP servers, adding scored entries for over 7,700 servers pulled from the official MCP Registry. The platform now indexes more than 206,000 total assets across skills, plugins, browser extensions, and server infrastructure.

Each MCP server gets a composite Manifest Score built from two signal families.

The Lineage Score evaluates publisher provenance: authorship history, community presence, repository activity, and verification signals. This is the "who made this and do they have a track record" question. For most MCP servers, the answer is murky. Unlike agent skills that often link to public repositories with commit history and maintainer context, many MCP servers expose only an HTTP endpoint. There's no source to inspect, no maintainer to look up. Lineage Score is trying to assign a confidence level to something that was never designed to be audited.

The Safety Score does behavioral analysis on the server's declared interface — scanning for contradictions, manipulative instructions, and prompt injection patterns embedded in tool descriptions. This matters because prompt injection through MCP tool definitions is already a documented attack vector. A malicious server can instruct an agent to exfiltrate data or ignore safety constraints through nothing more than a carefully worded tool description.

The combined Manifest Score gives security teams a ranked signal, not a binary pass/fail. That's the right framing — in an ecosystem this young, a clean score is a confidence indicator, not a guarantee.

What Security Teams Are Actually Using It For

The use case is straightforward: before an enterprise allows employees to connect an agent to an MCP server, someone needs to have looked at it. Right now, almost nobody has a formal process for that. Manifest is trying to be the equivalent of a CVE database for this layer of the stack.

The backstory on why this is urgent comes from Manifold's own threat research. An empirical study analyzed nearly 100,000 agent skills across two major registries and found 157 behaviorally confirmed as malicious. Those weren't fringe edge cases — each malicious skill averaged over four distinct vulnerabilities across multiple kill chain phases. The attack archetypes the researchers identified broke into two categories: Data Thieves that exfiltrate credentials through supply chain techniques, and Agent Hijackers that subvert agent decision-making through instruction manipulation.

On ClawHub, the OpenClaw marketplace, Antiy CERT confirmed over 1,100 malicious skills — roughly one in twelve packages. In March 2026, researchers demonstrated a ranking-manipulation attack that pushed a malicious skill to the top of its category by exploiting an unprotected API endpoint; it executed across more than 50 cities in six days, quietly exfiltrating identity data from installations inside several public companies.

MCP servers face the same threat surface, with less visibility.

Why This Is a Bigger Deal Than It Looks

The signal-to-noise problem in AI agent security is already bad. Skill scanners proliferated after the first wave of malicious packages — LLM-based classifiers, static analyzers, behavioral sandboxes — and they routinely disagree with each other. Manifold's bet is that the right approach is composite scoring across provenance and behavioral signals together, rather than analyzing components in isolation.

That bet is defensible. Provenance alone misses injected behavior. Behavioral analysis alone misses trust chain problems where a legitimate-seeming server was silently modified or taken over. The combination — Lineage plus Safety — is closer to how you'd actually want to evaluate a third-party component before wiring an autonomous agent to it.

The harder structural problem is that the MCP ecosystem wasn't designed with auditability in mind. HTTP endpoints with no associated repository are normal. Publishers with no community footprint are common. Manifold is trying to build a trust signal layer on top of infrastructure that never anticipated needing one. That's not a criticism of the tool — it's the accurate description of the problem the tool exists to solve.

Manifold Security CEO and co-founder Neal Swaelens put it directly: "Every developer today has coding agents on their laptop with access to source code, production systems, and CI/CD pipelines connected to an expanding ecosystem of MCP servers, skills, and third-party tools that no one is inspecting."

Availability and Access

Manifest is available now as a free, open-access platform. The MCP server index — 7,700 scored entries and growing — is searchable alongside the existing database of skills and plugins. Enterprise tiers extend coverage into Manifold's broader AIDR platform, which provides runtime visibility into agent behavior, live MCP server connections, privilege paths, and anomaly detection. Manifold raised an $8 million seed round in March 2026 led by Costanoa Ventures.

The MCP supply chain is the new npm — except agents don't just run code, they make decisions. Scoring 7,700 servers is a start. The question is whether enterprises adopt a review process before the next ranking-manipulation attack makes the choice for them.

Follow for more coverage on MCP, agentic AI, and AI infrastructure.

Freshworks Just Shipped an MCP Gateway Inside Its ITSM Platform. Here's What That Actually Changes.

Om Shree — Wed, 20 May 2026 09:43:29 +0000

Enterprise ITSM has always been a walled garden — every tool talking to nothing, every workflow requiring a custom integration ticket. Freshworks just put an MCP Gateway inside Freshservice and called it the antidote.

The Problem It's Solving

IT support has a ghost shift problem. Freshworks pulled telemetry from millions of service interactions and found that 47% of all IT tickets now come in outside standard business hours. Response times in that window run more than an hour slower, with SLA rates dropping as much as 5%. The workforce is distributed and always-on. The service desk isn't.

The deeper issue is architectural. Enterprise ITSM platforms — ServiceNow, Jira Service Management, and yes, older versions of Freshservice — were built as centralized systems of record. When AI started getting bolted on, it ran into the same problem every new layer runs into: it had no real access to the live context sitting across your HR system, your project tools, your incident logs. The AI was smart in isolation and blind in practice.

Freshworks is betting that the gap isn't a model problem. It's a context problem.

How Freddy AI Agent Studio Actually Works

The centerpiece announcement from Freshworks' Refresh 2026 conference is Freddy AI Agent Studio — a no-code environment for building and deploying custom AI agents inside Freshservice. But the more technically interesting piece is what sits underneath it: an MCP Gateway.

Model Context Protocol, for context, is the emerging standard for letting AI agents pull live data from external systems without custom integration code. Freshworks has implemented it as a native layer in Freddy AI, which means agents can now reach into Notion, ClickUp, Linear, Workday, Rippling, and the rest of the enterprise stack — not through brittle webhooks or bespoke connectors, but through a standardized protocol call.

The practical result: a Freddy AI agent handling an employee onboarding request can pull HR data from Workday, create a task in ClickUp, and update a Notion doc — all inside a single workflow, without an engineer writing glue code for each handoff.

On top of that, the studio ships with pre-built domain-specific agents for IT, HR, finance, and facilities, plus a library of agentic workflow templates. Agents meet employees where they already are — Microsoft Teams and Slack — rather than requiring portal logins.

The measurement layer is called AI Insights, paired with Experience Level Agreements (xLAs). The framing is: stop tracking ticket close times, start tracking whether employees actually got their problems solved. The xLA system uses weighted computation and AI analysis to connect service delivery metrics directly to employee sentiment scores.

Freshservice's unified data layer — which now includes the reimagined Freshservice ITAM and FireHydrant incident management products — is what gives the agents clean, reliable context to work with. The pitch is that Freddy doesn't need a months-long data cleanup project before it can run. The foundation is supposed to be ready on day one.

What IT and Service Teams Are Actually Using It For

The announced use cases cluster around two areas: employee self-service at scale, and cross-departmental workflow automation.

On the self-service side, the ghost shift problem is the clearest target. An employee submitting a payroll question at 11pm shouldn't wait until 9am for a human to look at it. A Freddy AI agent with access to Rippling and Workday can resolve that class of request without any queue time.

For cross-departmental automation, the MCP Gateway is doing the work that would previously have required a dedicated integration project. New hire onboarding — which typically touches IT provisioning, HR systems, facilities access, and project management — is the flagship example. The agent orchestrates across all of them through a single workflow definition.

Amerisure's IT Service Management team offered a concrete data point: ticket trend analysis that used to take an hour each morning now takes three minutes with Freddy Insights. That's the kind of mundane-but-real efficiency number that actually lands in a budget conversation.

Why This Is a Bigger Deal Than It Looks

MCP is moving fast as an enterprise standard, but most of what's been announced so far has lived at the developer tooling layer — IDEs, coding agents, local model setups. Freshworks embedding MCP as a production capability inside an ITSM platform used by companies like Bridgestone, New Balance, and S&P Global is a different category of deployment.

It's the first time MCP has been packaged as a no-code enterprise feature for IT ops teams who will never touch a config file. That changes who can deploy AI agents with live cross-system context — from platform engineers to service desk managers.

The governance angle matters too. The announcement specifically calls out "embedded governance" and deployment in "weeks, not quarters" as differentiators from legacy platforms. That's positioning against ServiceNow, which has its own agentic AI story but carries the implementation complexity that comes with it. If Freddy AI Agent Studio actually delivers on that timeline claim, the competitive pressure on the ITSM incumbents gets real.

Keith Kirkpatrick at The Futurum Group put it clearly: the market is shifting from AI pilots to production deployments, and the platforms that combine integration breadth, deployment speed, and governance tooling in one package are the ones that will win the next wave of enterprise deals.

Availability and Access

Freddy AI Agent Studio and the MCP Gateway are available now as part of Freshservice. The FireHydrant incident management integration and the reimagined ITAM module are included in the unified platform. Freshworks published a Futurum Group report showing 168% ROI over three years for enterprises moving off legacy ITSM platforms, available on their site. More detail on the May launch is at freshworks.com.

MCP just moved from developer infrastructure into enterprise service operations. The question now is how fast the other ITSM platforms respond — and whether ServiceNow's complexity becomes the thing that costs it the mid-market.

Follow for more coverage on MCP, agentic AI, and AI infrastructure.

Microsoft Just Published a Blueprint for Self-Healing CI/CD. Here's What the Observe-Analyze-Act Loop Actually Does.

Om Shree — Wed, 20 May 2026 01:58:13 +0000

Pipeline failures are one of those things every engineering team accepts as friction they can't eliminate — something breaks at 2am, someone gets paged, someone debugs, someone fixes. Microsoft just published a working architecture that removes humans from that first-response loop entirely.

The Problem It's Solving

Standard CI/CD pipelines fail, send you a stack trace, and wait. One small typo in a backend pool member IP can tank a deployment. The debugging cycle is manual by design: read the logs, understand the context, figure out what broke, push a fix, re-run. For teams migrating infrastructure — legacy load balancer settings to Azure ILB rules, for instance — that cycle can eat days.

The self-healing pipeline architecture Microsoft outlined on the Azure Infrastructure Blog replaces that cycle with an agentic loop. The pipeline still fails. But instead of waiting for a human to read the error, an AI agent reads it, understands it in infrastructure context, and proposes (or executes) a fix.

How It Actually Works

The self-healing workflow is an agentic loop consisting of three phases: Observe, Analyze, and Act. The process begins with an event-driven trigger. When an Azure DevOps pipeline fails, a webhook sends the telemetry and build logs to an Azure Function. The logs are then passed to GPT-4o via the Microsoft AI Foundry endpoint.

That last part is the hinge. The model doesn't just look for error codes — it understands the infrastructure context. There's a meaningful difference between a regex that matches "connection refused" and a model that can reason about why a backend pool misconfiguration would produce that error given the surrounding deployment context.

The implementation uses Azure AI Foundry's ChatCompletionsClient to call GPT-4o with a system prompt that frames it as an autonomous DevOps assistant. The agent receives the raw error logs, analyzes them, and returns a proposed fix. That fix can then trigger a GitHub pull request or an Azure DevOps pipeline update automatically — the "Act" phase closing the loop.

Microsoft AI Foundry provides a standardized way to call Azure OpenAI, which matters for teams that want consistent API surface across environments rather than managing direct OpenAI endpoint configurations per service.

On why GPT-4o specifically: native tool use makes it specifically optimized for function calling, allowing the agent to interact with Azure DevOps APIs and GitHub seamlessly. As a first-party service, Azure OpenAI is also the most cost-effective path to running production-grade agents, and GPT-4o processes complex logs in seconds, identifying errors much faster than a human scanning line by line.

What Teams Are Actually Using This For

The Microsoft post describes a concrete infrastructure migration scenario: mapping legacy load balancer settings — like fastest-app-response or source-address persistence — to Azure ILB rules, where a single typo in backend pool member IPs can tank a deployment.

The agent now scans those configs before the pipeline runs, flags mismatches, and suggests the correct Azure-native equivalent. It's saved the team days of trial-and-error debugging. That's the pre-failure application — catching configuration drift before it becomes a deployment failure, not just responding after.

Post-failure, the loop handles anything where the fix is diagnosable from logs alone: dependency mismatches, misconfigured environment variables, failed health checks on newly deployed resources. The agent reads the failure telemetry, identifies the category of error, and proposes a remediation that goes straight to a PR for review — or executes directly, depending on how the "Act" phase is configured.

This connects to a broader pattern Microsoft's platform engineering teams have been documenting. When a deployment degrades, Argo CD fires a webhook to GitHub Actions, which creates a GitHub issue with the failure details — cluster name, resource group, the initial telemetry. The agent reads the issue, authenticates to Azure via Workload Identity Federation, runs kubectl commands against the affected cluster, and queries the AKS MCP server for deeper telemetry. The self-healing CI/CD architecture is the Azure DevOps-native version of the same idea.

Why This Is a Bigger Deal Than It Looks

The architecture itself isn't complex — webhook, Azure Function, GPT-4o call, conditional action. What's significant is that it's now a documented, first-party pattern from Microsoft's Azure Infrastructure team, with a real use case attached. That's different from a proof-of-concept.

AI agents don't magically fix broken engineering practices — they scale them. If your CI/CD pipelines are fragile, agents will break them faster. If your test coverage is thin, agents will ship untested code at higher velocity. The self-healing architecture assumes your pipeline failures are diagnosable from logs. For teams with well-structured observability, that's most failures. For teams without it, this pattern will surface the gaps fast.

There's also a shift in how pipeline failures are categorized. Traditional CI/CD pipelines rely on binary assertions — Assert X == Y. But AI agents are probabilistic. The self-healing loop works well on the deterministic failure surface — config errors, missing dependencies, mismatched API parameters. The harder problem, testing and validating the agent's own proposed fixes before they ship, is where the architecture gets more complex. For now, the PR-as-output model keeps a human in the loop on the final action, which is the right call for production systems.

By shifting the burden of initial troubleshooting to automated agents, teams aren't just saving time — they're increasing the reliability of their entire stack. That framing is accurate, but the reliability gain depends entirely on how the "Act" phase is scoped. Agents that open PRs are recoverable. Agents with direct write access to production pipelines require more careful guardrails before you'd want them running unsupervised.

Availability and Access

The pattern runs on Azure DevOps, Azure Functions, and Azure OpenAI via AI Foundry. No preview program required — these are all generally available services. The full implementation walkthrough, including the ChatCompletionsClient setup and the webhook-to-function wiring, is in the Microsoft Tech Community post.

The architecture is modular enough to adapt: swap Azure Functions for any serverless compute that can receive a webhook, swap GPT-4o for any model with strong function-calling support, and scope the "Act" phase to whatever your organization's change management policy allows.

The pipeline-as-passive-executor era is ending. Pipelines that can read their own failures, reason about them, and act on them are the next default — and Microsoft just made the blueprint public.

Follow for more coverage on MCP, agentic AI, and AI infrastructure.

Google Just Rebuilt Its Enterprise AI Stack at I/O '26. Here's What Gemini 3.5, Spark, and Antigravity Actually Do.

Om Shree — Wed, 20 May 2026 01:55:14 +0000

Google I/O '26 dropped today, and for the first time in a while, the enterprise announcements are the ones worth paying attention to. Not because of model benchmarks — though those are interesting — but because Google just shipped an integrated agentic stack that reaches from the model layer all the way down to the individual worker's inbox.

The Problem It's Solving

Enterprise AI has had a deployment problem. Most organizations have access to capable models, but the path from "we have Gemini" to "our teams are actually running less manual work" has involved a lot of custom integration, fragile automations, and agents that can't see across tools. What Google is trying to do with this I/O release is close that gap — ship the plumbing, not just the model.

The announcement covers five distinct products: Gemini 3.5, Gemini Omni, Google Antigravity, Gemini Spark, and a Managed Agents API on Agent Platform. Each one sits at a different layer of the stack.

How It Actually Works

Start with the model. Gemini 3.5 Flash is the new baseline — Google's claim is that it rivals larger flagship models while staying within Flash's speed and cost profile. The numbers they're citing: 76.2% on Terminal-Bench 2.1, 83.6% on MCP Atlas, and 84.2% on CharXiv for multimodal understanding. Gemini 3.5 Pro is in testing and coming next month.

That MCP Atlas benchmark is worth noting specifically. Google scored Gemini 3.5 Flash against a benchmark designed around Model Context Protocol task completion — the same protocol that's become the de facto standard for tool-using agents across the industry. Getting 83.6% there isn't just a number; it's a signal about where Google thinks the evaluation bar for agentic models should be.

Gemini Omni is the video-first model — takes text, audio, image, and video inputs and produces dynamic video output. Think post-production automation, e-commerce virtual try-ons, content localization. It's rolling out in the coming weeks via the Gemini API.

Antigravity 2.0 is where things get more interesting for developers. It's a standalone desktop app and now integrates with Agent Platform, meaning it inherits Google Cloud's data privacy protections by default. There's also an Antigravity CLI for teams that want a lighter-weight interface. The pitch from AirAsia Next's CTO: over half of their production-ready code now comes through Antigravity agentic workflows. That's a real number from a shipping company, not a demo.

Gemini Spark is the personal agent layer. It runs 24/7 in the background, connects to Workspace plus external connectors like Salesforce, Zendesk, ServiceNow, and SharePoint, and can take multi-step actions autonomously — with approval gates for anything high-risk. Every task runs in an ephemeral VM, credentials never touch the agent directly, and all traffic routes through an Agent Gateway that enforces DLP policies. The isolation story is more specific than most personal agent announcements tend to be.

The Managed Agents API lets developers spin up custom agents via a single API call, running in Google-hosted environments. No infrastructure to manage; governance and security inherit from Agent Platform automatically.

And there's CodeMender — an AI security agent from Google DeepMind, now integrated into Agent Platform. It finds vulnerabilities, proposes patches, tests them, and can apply fixes across dependent systems with developer approval.

What Developers and Enterprises Are Actually Using This For

The use cases Google is demonstrating are specific enough to be useful as a map.

For IT operations: Spark monitors ServiceNow, detects recurring incidents, creates escalated Jira tickets, drafts incident reports, and pings the right manager for approval before sending anything externally.

For sales: Spark pulls account history from Salesforce, cross-references support tickets from Zendesk, identifies churn signals, and drafts a retention strategy — sitting in draft until the salesperson approves it.

For product launches: Antigravity 2.0 handles simultaneous agent-driven execution across code generation, asset creation, and customer email drafts, all orchestrated from a single workspace.

For security: CodeMender audits codebases, recommends patches, and deploys them with human sign-off. This is particularly relevant for teams carrying compliance obligations where every change needs an audit trail.

Why This Is a Bigger Deal Than It Looks

The piece that matters most here isn't any single product — it's that Google is shipping an end-to-end agentic stack with enterprise data controls built in from the start, not bolted on.

Most enterprise AI deployments today involve stitching together a model API, a separate orchestration layer, custom connector work, and some homegrown governance layer. Google is trying to collapse that into a single platform surface where the governance, security, and agent behavior are codesigned. The Managed Agents API making Agent Platform's data protections automatic is a specific example of what that looks like in practice.

The MCP Atlas benchmark score is also a tell. Scoring Gemini 3.5 Flash against an MCP-specific benchmark is an implicit endorsement of MCP as the standard evaluation surface for agentic capability — significant given how much momentum MCP has built across the industry since Google Cloud Next '26.

Availability and Access

Gemini 3.5 Flash is live today in Gemini Enterprise, Google AI Studio, and Antigravity. Gemini Omni Flash comes in the next few weeks. Gemini Spark in the Gemini Enterprise app is rolling out soon; Workspace preview for business customers follows. Antigravity in Gemini Enterprise arrives in the coming months. Managed Agents API documentation is live at docs.cloud.google.com.

Gemini 3.5 Pro remains in testing, expected next month.

The shift from AI-assisted work to AI-executed work — with humans approving rather than doing — is the actual direction this points. Google's bet at I/O '26 is that enterprises will adopt that model faster if the security and governance story is tight from day one, not something they have to build themselves.

Follow for more coverage on MCP, agentic AI, and AI infrastructure.

Gemma 4 Didn't Just Get Smarter. It Became a Different Kind of Model. Here's What the Agentic Numbers Actually Mean.

Om Shree — Wed, 20 May 2026 01:18:22 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

Every open-weight model release in 2026 comes with a benchmark table and a claim about efficiency. Most of them are incremental. Gemma 4 has one number that isn't: 6.6% to 86.4% on agentic tool use. That's not an improvement. That's a category change.

The Number That Actually Matters

When Google DeepMind dropped Gemma 4 on April 2, 2026, the coverage focused on the headline scores - AIME 2026, LiveCodeBench, Arena AI rankings. Those numbers are impressive. The 31B dense model scores 89.2% on AIME (up from Gemma 3 27B's 20.8%), 80% on LiveCodeBench (up from 29.1%), and sits third among all open models on Arena AI.

But the benchmark that actually changes what developers can build is τ2-bench - the agentic tool use evaluation that measures whether a model can reliably execute multi-step tasks across real tool schemas, partial information, and policy constraints. Gemma 3 27B scored 6.6% on τ2-bench Retail. Gemma 4 31B scores 86.4%.

Put that concretely: Gemma 3 failed 93 times out of 100 on structured tool use. Gemma 4 fails roughly 14 times out of 100. Those aren't the same class of model for anyone building agents.

The 26B MoE variant scores 85.5% on the same benchmark while activating only 3.8 billion of its 26 billion parameters per forward pass. You get near-flagship agentic capability at a fraction of the inference cost.

What Changed Architecturally

The τ2-bench jump didn't happen because Google made a bigger model. Gemma 4 31B has roughly the same parameter count as Gemma 3 27B. What changed is how the model was trained and what capabilities were baked in natively.

Gemma 4 ships with native function calling via dedicated control tokens - structured tool use is built into the model's vocabulary rather than bolted on through prompt engineering. It has configurable thinking modes where the model can generate 4,000+ tokens of step-by-step reasoning before committing to a tool call, which directly improves accuracy on complex multi-step pipelines. And it has native system prompt support, meaning you can define agent behavior, tool schemas, and constraints in the system turn without workarounds.

The architecture also came from the same research stack as Gemini 3, Google's closed frontier family. The knowledge transfer is visible in the benchmark gaps - particularly on tasks requiring multi-turn planning and policy-compliant tool execution, which are exactly the conditions τ2-bench tests.

One important hardware caveat on the 26B MoE: while it activates only 3.8B parameters per token during generation, all 26 billion parameters must be loaded into memory for routing. Its memory footprint is close to a dense 26B model, not a 4B one. The speed advantage is real - the MoE reaches 40+ tokens per second on consumer GPUs versus 10+ for the dense 31B - but size your VRAM accordingly before assuming it runs like a small model.

Why This Matters for Developers Building Agents

Before Gemma 4, the honest answer to "should I use a local open model for my agent?" was usually no - at least not for anything where tool call reliability mattered. A 6.6% success rate on structured tool use means the agent fails almost every time it needs to call a function, check a schema, or chain tool outputs. That's not a foundation for anything in production.

86.4% changes the calculation. It's not at parity with frontier closed models - GPT-5.4 still leads on complex multi-step benchmarks - but it's in the range where developers can build real agentic workflows locally, catch edge cases with retries and error handling, and ship something that actually works. The failure modes are now manageable rather than fundamental.

This matters especially for three deployment contexts that couldn't practically use local models before.

Privacy-sensitive agentic applications. Healthcare tools, legal review pipelines, financial compliance agents - any workflow where raw query data can't leave the device. Gemma 4's native function calling running locally means the model decides which tool to call on-device, and only the structured API request goes out over the network. Your prompt, your context, and your intermediate reasoning stay local.

Cost-controlled production agents. Per-token API costs accumulate fast in multi-step agentic workflows where each task triggers 5–20 tool calls. Running Gemma 4 26B MoE locally on a consumer GPU eliminates that variable entirely. The 26B MoE's inference speed (40+ tokens/sec on an RTX 4090) is fast enough for real-time agentic loops without the latency penalty you'd expect from a model this capable.

MCP-integrated local pipelines. Gemma 4's native function calling maps directly to Model Context Protocol tool schemas. The setup is straightforward: run Gemma 4 via llama.cpp or vLLM with an OpenAI-compatible endpoint, point your MCP client at it, and the model handles tool selection and call generation locally. What previously required a cloud model API can now run on your own infrastructure with no per-call cost and no data leaving your server.

Picking the Right Model for Agentic Work

Gemma 4 ships as a family of four, and the right choice for agentic deployment isn't automatically the biggest one.

The 31B dense model is the accuracy ceiling - highest τ2-bench score, best reasoning on complex multi-step tasks, strongest fine-tuning base. It runs unquantized on a single 80GB H100, and quantized (Q4_K_M) on consumer GPUs with 24GB+ VRAM. If you're building a server-side agent where quality is the constraint and hardware isn't, start here.

The 26B MoE is the practical production choice for most agentic deployments. 85.5% on τ2-bench is close enough to the 31B that the tradeoff is almost always worth it: 4x faster token generation, lower GPU memory pressure during inference, same 256K context window. For agents running continuous loops or handling high request volume, the speed difference compounds significantly.

The E4B (4B edge model) hits 52% on LiveCodeBench and supports native audio input - the only model in the family that handles speech natively. If you're building on-device Android agents that need voice input or mobile-first agentic workflows, this is your model. The agentic tool use scores are lower, but the hardware targets are completely different: this runs on a phone.

The E2B (2B edge model) reaches 133 prefill tokens/sec on a Raspberry Pi 5 CPU. For IoT agents, offline-first deployments, or anything constrained to sub-1.5GB RAM, it's the only viable option in this family and still handles multimodal input.

The Apache 2.0 License Is Not a Minor Detail

Every previous Gemma release shipped under a Google proprietary license. Gemma 4 is the first under Apache 2.0.

For agentic AI specifically, this matters more than it does for general language model use. Agents get embedded in products. They get fine-tuned on proprietary data. They get wrapped in commercial services that customers pay for. All of that required legal review and negotiation under the old Gemma license. Under Apache 2.0, you can build, ship, fine-tune, and commercialize without clearing Google's terms first.

For startups and solo developers building on open-weight models, this is one less legal headache at exactly the moment when the model became capable enough to actually deploy in production.

Getting Started

# Pull with Ollama - fastest path to a running model
ollama pull gemma4:31b
ollama pull gemma4:26b-moe

# Or via Hugging Face
pip install transformers

Google AI Studio has the 31B and 26B MoE available in-browser with no local setup. Google AI Edge Gallery covers the E4B and E2B for on-device testing. Full framework support at launch includes Hugging Face Transformers, vLLM, llama.cpp, MLX, NVIDIA NIM, SGLang, Ollama, LM Studio, and more.

For MCP integration, the gemma-mcp package handles client setup against a locally-served Gemma 4 endpoint.

One practical note if you're running the 26B MoE via Ollama on Apple Silicon: as of v0.20.3 there's a known streaming bug that routes tool-call responses to the wrong field. Use llama.cpp directly or wait for the Ollama fix before deploying in an agentic context.

The Honest Caveat

86.4% on τ2-bench Retail is not 100%. In agentic pipelines where tool calls chain across 10–20 steps, a 14% per-call failure rate compounds. Production deployments need retry logic, error handling, and validation layers between tool outputs - the same engineering discipline you'd apply to any distributed system with failure modes.

Gemma 4 doesn't eliminate the need for defensive agent architecture. It makes the failure rate manageable enough that the architecture is worth building.

That's the real shift. Not that local open models are now perfect for agentic work. It's that they crossed the threshold from "interesting experiment" to "defensible production choice" - and they did it on your hardware, under a license you can actually ship with.

Follow for more coverage on MCP, agentic AI, and AI infrastructure.

Hermes Agent's Learning Loop Is the Only Thing That Makes an Agent Actually Get Better. Here's How It Works

Om Shree — Wed, 20 May 2026 01:09:56 +0000

This is a submission for the Hermes Agent Challenge

Most AI agents have a memory problem they don't admit to. Every session ends, the context resets, and tomorrow you're explaining your codebase, your preferences, and your constraints from scratch again. Hermes Agent by Nous Research is the first open-source agent that structurally solves this - not through a configurable memory feature, but through a closed learning loop baked into the agent runtime itself.

Why Every Other Agent Forgets

The standard agentic loop is three steps: receive task, plan and execute, return result. State resets. The next task starts blank.

Most frameworks tried to patch this with long-term memory bolted on after the fact - a vector database that stores embeddings of past conversations. The problem is that vector retrieval answers the question "what did we talk about that was similar to this?" It doesn't answer "how did I actually solve this class of problem last time, and what were the exact steps?" Those are different questions, and conflating them is why most "memory-enabled" agents still feel stateless in practice.

Hermes Agent adds two steps after the response is returned. Step four: the agent receives an internal nudge to evaluate whether the session is worth persisting. Step five: if the task involved five or more tool calls, the agent autonomously writes a skill document describing exactly how it was solved, then indexes it into memory for every future session. That's the loop. And it's the reason the project crossed 100,000 GitHub stars seven weeks after launching on February 25, 2026.

The Five Stages in Practice

Understanding the loop means understanding what actually happens between "you send a message" and "the agent responds."

A message arrives - from CLI, Telegram, Discord, Slack, WhatsApp, Signal, or a scheduled cron job. They all enter the same execution engine. Before the model sees your query, the agent runs retrieval: it queries a local SQLite database with FTS5 full-text search, pulling relevant past skills and notes at roughly 10ms latency across 10,000+ indexed documents. The model then plans, invokes tools, executes, and streams output - that's the ordinary agent loop you know.

After the response, the loop diverges. The agent checks its own session. Did this involve meaningful tool sequences? Is there a generalizable procedure here? If yes, a skill document gets written to ~/.hermes/skills/ in plain Markdown following the agentskills.io open standard. That file is immediately searchable by every future session. The next time a similar problem arrives, Hermes retrieves the procedure rather than rediscovering it.

The practical result: independent benchmarks show agents carrying 20+ self-created skills complete similar future research tasks roughly 40% faster than fresh agent instances on the same job. The honest caveat is domain specificity - a skill learned from summarizing GitHub PRs doesn't transfer to planning database migrations. Cross-domain generalization is still unsolved. But within a narrow, repetitive domain, the compounding effect is real and measurable.

Four Memory Layers, Each Solving a Different Problem

The learning loop is the process. The memory system is what it writes into, and it's split across four distinct layers.

Session memory is ordinary context management - the current conversation window. Nothing novel, but Hermes exposes /compress, /usage, and /insights slash commands so you can monitor and control it explicitly rather than waiting for silent overflow.

Persistent memory is the SQLite FTS5 store where completed task outcomes and agent-curated notes live. Everything sits in ~/.hermes/ on your own machine - no cloud round-trips, no telemetry, no third-party memory provider. The architecture scales comfortably to around 100K documents before you'd want to swap in a dedicated vector store like Qdrant or Chroma.

The skill document store is the output of the learning loop. Skills are plain Markdown files - portable, human-readable, diff-able in version control. Crucially, only skill names and brief descriptions load into the system prompt by default. Full skill bodies load on demand. That design is why a library of 200 skills doesn't blow your context budget. As of v0.10.0, Hermes ships 96 bundled skills plus 22 optional ones across 26+ categories covering MLOps, GitHub workflows, research pipelines, scraping, code execution, and more.

Honcho is the optional fourth layer - a user modeling system built via integration with Plastic Labs' dialectic architecture. Honcho passively accumulates your preferences, communication style, tech stack, and domain vocabulary across sessions. It's the layer that gives the "grows with you" quality after several hundred interactions. For task-specific deployments, the other three layers are usually sufficient.

One trade-off worth naming: the memory system is automatic but not fully transparent. You can't export "everything Hermes knows about me" as a single human-readable file. If you're operating under GDPR, HIPAA, or CMMC constraints, factor that into your deployment decision.

Skills Are the Interface Between Learning and Utility

A skill in Hermes terms is a Markdown document describing how to accomplish a specific procedure - which tools to invoke, in what order, with what parameters, and what pitfalls to avoid. Two kinds coexist: the bundled catalog that ships with every install (curated and security-reviewed by Nous Research), and auto-created skills generated by the learning loop itself.

Because skills follow the agentskills.io open standard, they're not locked to Hermes. The same file can run inside any framework that implements the spec. As of mid-April, the community hub was carrying 643 reviewed skills - smaller than OpenClaw's 13,000+ marketplace, but curated in a way that sprawling open marketplaces typically aren't.

One practical gotcha: auto-generated skills from moderate tasks (5–10 tool calls) tend to be tight and reusable. Skills generated from very complex multi-phase tasks (50+ tool calls) sometimes over-generalize or bake in too much session-specific context. A manual review pass of auto-generated skills during your first month of use is time well spent.

Why This Architecture Actually Matters

The agent space in 2025 and early 2026 was dominated by a certain kind of demo: impressive one-shot task execution, elegant tool orchestration, clean architecture diagrams. What almost nobody shipped was an agent that got measurably better at your specific workflows the longer it ran.

Hermes Agent's learning loop is a structural bet that agents are most valuable not as general-purpose task executors but as accumulating specialists. If your workflows are repetitive and structured - running the same class of tasks against the same codebase over months - Hermes compounds in ways that prompt-engineered agents simply cannot match. If your workflows are broad and constantly different, the loop has nothing to work with, and the skill library stays thin.

Know which category you're in before architecting around this. The self-improving agent is a compelling abstraction, but it earns its value through repetition. A month of daily use inside a narrow domain will teach you more about whether this architecture fits your work than any benchmark.

There's also a research angle that doesn't get enough coverage. Nous Research built Atropos RL environment integration and trajectory export directly into Hermes. Every run, every successful tool sequence, every generated skill is a candidate trajectory for fine-tuning smaller, purpose-built models. Hermes isn't just an application - it's a data pipeline for the next generation of tool-calling models, built by the lab that trains them. That dual-use architecture is rare, and it's worth understanding if you're thinking about this space beyond the immediate "build an agent" use case.

Getting Started

# Install on Linux / macOS / WSL2
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash

# Set your model provider
hermes model

# Start your first session
hermes

Full documentation at hermes-agent.nousresearch.com/docs. The quickstart gets you to a running agent in under five minutes.

The Bigger Question

The open-source agent field is still mostly asking "can the agent do this task?" Hermes Agent is asking a different question: "does the agent get better at this task over time?" Those are not the same question, and the second one is harder.

Whether the learning loop delivers compounding improvement at the architectural level - not just better UX - is something the research community is still working out. The hermes-agent-self-evolution companion project applies DSPy and GEPA to optimize skills and prompts against benchmarks. If that feedback loop produces measurable improvement on public evals, the "self-improving" framing holds. If gains plateau after a few iterations, the learning loop is a better developer experience - not a better algorithm. Either way, it's the most honest attempt at the problem anyone has shipped in the open.

Every other agent forgets. That's still the baseline. Hermes is trying to make the baseline obsolete.

Follow for more coverage on MCP, agentic AI, and AI infrastructure.

Google AI Edge Gallery Now Runs MCP On-Device. The Privacy Architecture

Om Shree — Wed, 20 May 2026 00:44:23 +0000

This is a submission for the Google I/O Writing Challenge

On-device AI has spent most of its existence being impressive in demos and limited everywhere else. Google just changed the constraint that mattered most: the model couldn't reach anything outside the app sandbox.

The Problem It's Solving

Local inference is great for privacy and latency. It's lousy for usefulness. A model running entirely on your phone can answer questions from its training data and nothing else — no calendar, no inbox, no live web, no external tools. You get an isolated reasoning engine that can't act on the world around it.

That's the fundamental tension in edge AI: the moment you connect a model to external systems, you typically route the requests through a server. The privacy story falls apart. The latency goes up. The offline capability disappears.

Google AI Edge Gallery just shipped an answer to this. The May 19 update adds Model Context Protocol (MCP) support to the Android app, alongside scheduled notification reminders and persistent chat history. Together, these three features move the app from a model playground into something that starts to look like an actual on-device agent runtime.

How It Actually Works

The MCP integration runs over Streamable HTTP, currently experimental and Android-only (iOS support is coming). The architecture is worth understanding carefully, because it's not what you might expect.

When you register an MCP server URL in the app, it dynamically pulls tool definitions and resource schemas directly into Gemma 4's system prompt on-device. The reasoning happens entirely on the phone. Gemma 4 decides locally which tool to call, generates the request locally, and then sends that request to wherever the MCP server lives — your home computer, a cloud endpoint, wherever. The model itself never leaves the device.

This is a meaningful architectural choice. The tool selection and orchestration logic stays private. Only the structured API call goes out over the network, not your raw query or whatever context the model was working with.

The notification system works differently: it's a "Schedule Notification" skill that sets local OS-level reminders. When you tap one, the app opens directly to the right tool and launches a Gemma 4 session automatically. No server involved at all.

Chat history persistence runs through the LiteRT-LM backend's fast prefill capability. On modern phone GPUs, prefill can hit over 3,000 tokens per second, which means the model can reconstruct a long previous session almost instantly when you reopen the app. Sessions maintain state across text, images, and audio.

What Developers Are Actually Using It For

The MCP use cases Google demos are practical rather than speculative. Connect to a Google Workspace MCP to query your calendar or check your inbox. Use a Google Maps MCP to ask about travel times in natural language. Connect a web fetch MCP to pull live documentation or news into the model's context.

The notification + session continuity combination opens up something more interesting: scheduled routines that actually maintain context. A mood tracking workflow that reminds you every evening at 10 PM, opens to Gemma 4, and — because chat history persists — can look back at previous entries to surface trends. A morning briefing that reads your local calendar and gives you a summary before you leave the house. A daily "learn something new" prompt that generates a shareable visual infographic from whatever topic you pick.

The community-built skills on the GitHub Discussions page are already going further: lightweight web search integrations for live weather and currency data, parsers that turn images and HTML into structured data for semantic search, quiz generators, language translators, offline puzzle games.

Google has also added the ability to edit the system prompt directly from chat settings, which is the right call for a developer-facing app. You can define personas, set output constraints, or experiment with prompting approaches without touching any config files.

One practical note for anyone building on this: on-device models have smaller context windows than their server-side counterparts. Google explicitly recommends keeping MCP tool descriptions short and returning bite-sized data snippets rather than long text blocks. The architecture rewards lean, well-scoped tool definitions.

Why This Is a Bigger Deal Than It Looks

MCP has spent most of 2025 and early 2026 as an enterprise and desktop story. The tooling, the infrastructure, the conversation — it's been aimed at developers building server-side agents with access to large context windows and cloud compute.

Putting MCP into a phone app, powered by a model running entirely on-device, moves the protocol into a different category of deployment. The reasoning stays on the device. Only structured tool calls go out over the network. That's a viable architecture for healthcare apps, legal tools, or anything else where raw query data can't leave the device.

There's also something worth noting about the open-source angle here. The Google AI Edge Gallery repository is public, the skill system is extensible, and the community is already building on it. This isn't a closed platform with a curated app store of approved integrations. Anyone can write an MCP server, register it in the app, and extend what on-device Gemma can reach.

The combination of persistent sessions, proactive notifications, and external tool access is basically the minimum viable definition of an ambient agent: something that maintains context over time, reaches external systems when needed, and can act without being explicitly invoked. Google shipped all three in one update.

Availability and Access

The MCP integration is live now in the Android version of Google AI Edge Gallery. iOS support is listed as coming soon. Technical documentation and example MCP configurations are in the GitHub repository. The app is free on both the Play Store and App Store.

The edge AI stack — Gemma 4 running locally, MCP bridging to external tools, LiteRT-LM handling fast prefill — is now available to any developer who wants to build on it. The interesting question is which use cases the community finds that Google hasn't thought of yet.

MCP's reach just extended to every Android phone. That's a different surface area than any enterprise deployment.

Follow for more coverage on MCP, agentic AI, and AI infrastructure.

GPT-5.5 vs Claude Opus 4.7 vs Gemini 3.1 Pro: The Frontier Model Showdown

Om Shree — Sat, 25 Apr 2026 03:38:59 +0000

Three flagship models. Three different labs. Three different bets on what production AI actually needs in 2026. GPT-5.5 dropped April 23, Opus 4.7 dropped April 16, and Gemini 3.1 Pro has been in developer preview since February 19. If you're building agents, coding tools, or any serious production workflow right now, you need to know exactly where each one wins — and where it doesn't.

This is the breakdown with no hedging.

The Problem With "Best Model" Claims

Every lab calls its flagship the best. The honest answer is that no single model wins across every workload in April 2026. The differentiation has shifted from raw intelligence to specificity: which model is best for your tasks, at your price point, on your infrastructure. The gap between these three models on most benchmarks is narrow enough that the wrong choice costs more in API spend and rework than the right choice saves in capability.

Here's how to actually read the comparison.

The Benchmark Map: Who Wins What

Agentic coding is the highest-stakes category right now, and the results are split.

On Terminal-Bench 2.0, GPT-5.5 achieves 82.7%, up from GPT-5.4's 75.1%. Claude Opus 4.7 sits at 69.4%. Gemini 3.1 Pro scores 54.2% on SWE-Bench Pro. GPT-5.5 wins Terminal-Bench decisively — this benchmark tests real command-line workflows, shell scripting, container orchestration, and tool chaining. If your agent lives in a terminal, this is the number that matters most.

But on SWE-Bench Pro — real GitHub issue resolution across Python, JavaScript, Java, and Go — the rankings flip. Opus 4.7 scores 64.3% on SWE-Bench Pro, leapfrogging both GPT-5.4 at 57.7% and Gemini at 54.2%. GPT-5.5's score of 58.6% puts it ahead of GPT-5.4 but still behind Opus 4.7 on this specific benchmark.

Tool use and MCP is Opus 4.7's clearest win. Opus 4.7 leads MCP-Atlas at 77.3%, ahead of GPT-5.4 at 68.1% and Gemini 3.1 Pro at 73.9%. MCP-Atlas measures complex, multi-turn tool-calling scenarios — the closest thing to a real production agent benchmark. For teams building orchestration agents that route across multiple tools in a single workflow, this result is the one to pay attention to.

Scientific reasoning (GPQA Diamond) is essentially a three-way tie. Opus 4.7 comes in at 94.2%, Gemini 3.1 Pro at 94.3%, and GPT-5.4 Pro at 94.4%. GPT-5.5 does not break this tie meaningfully. This benchmark is approaching saturation at the frontier — the differentiation is elsewhere.

Abstract reasoning (ARC-AGI-2) is Google's headline story. Gemini 3.1 Pro scored 77.1% on ARC-AGI-2, more than double Gemini 3 Pro's score of 31.1%. ARC-AGI-2 specifically tests novel pattern recognition that models cannot have memorized during training. Neither OpenAI nor Anthropic has published comparable scores here, which tells its own story.

Computer use is close but GPT-5.5 nudges ahead. GPT-5.5 achieves 78.7% on OSWorld-Verified, Opus 4.7 reaches 78.0%, both up from GPT-5.4's 75.0%. A 0.7-point gap in Opus 4.7's favor on the previous generation is now reversed — marginally.

Web search and browsing is GPT-5.5's other clear advantage. GPT-5.4 held a BrowseComp lead at 89.3% versus Opus 4.7's 79.3%. GPT-5.5 maintains this gap. If your agent needs to navigate the web reliably, OpenAI has the edge.

How Each Model Actually Works Differently

GPT-5.5 is a genuinely new foundation. It's the first fully retrained base model since GPT-4.5 — not a refinement of the GPT-5 architecture, but a model trained from scratch. That explains the Terminal-Bench jump. The model reasons about code execution differently at a fundamental level, not just incrementally better. It matches GPT-5.4's per-token latency while performing at a higher intelligence level — and uses fewer tokens to complete the same Codex tasks.

Claude Opus 4.7 introduced a behavioral shift that the benchmarks only partially capture. It devises ways to verify its own outputs before reporting back, catches its own logical faults during the planning phase, and accelerates execution far beyond previous Claude models. This isn't just a score improvement — it's a change in how the model approaches long-horizon agentic work. Low-effort Opus 4.7 is roughly equivalent to medium-effort Opus 4.6, which means the efficiency gain shows up in your token bill before you even tune effort levels. The vision upgrade also deserves mention: image resolution jumped from 1.15 megapixels to 3.75 megapixels — more than three times the pixel count of any prior Claude model.

Gemini 3.1 Pro plays a different game: multimodal breadth and context scale. It is the only frontier model with true native multimodal support — handling text, images, audio, and video simultaneously within a single unified model. GPT-5.5 handles text and images but not audio or video at the API level. Opus 4.7 has excellent vision but no audio or video. The context window is 2 million tokens — the largest of any frontier model available today. In practical terms, this means processing entire book collections, extensive legal contracts, or hours of video in a single prompt. GPT-5.5 and Opus 4.7 both offer 1M context windows, but Gemini doubles it.

What Developers Are Actually Using Each One For

GPT-5.5 in Codex is the default choice for infrastructure automation, CI/CD scripting, and multi-step computer use. The Terminal-Bench lead is real and it matters for DevOps-adjacent workflows. Cursor co-founder Michael Truell confirmed GPT-5.5 stayed on task longer and showed more reliable tool use than GPT-5.4. It's also the model to choose if your agent does significant web navigation.

Claude Opus 4.7 is the strongest choice for production coding agents that need to reason through ambiguous, multi-file engineering problems — and for any workflow that requires reliable tool orchestration. Vercel confirmed Opus 4.7 does proofs on systems code before starting work — a new behavior not seen in prior Claude models. For legal tech, financial analysis, and document-heavy enterprise work, the Finance Agent benchmark win (64.4%, state-of-the-art at release) and the BigLaw Bench result (90.9%) are concrete signals.

Gemini 3.1 Pro is the right choice when your workload is research-heavy, multimodal by nature, or involves very long context that would push the other models to their limits. It's also the only model in this group that can natively process video alongside text — useful for content pipelines, educational tooling, and media analysis.

The Pricing Table That Actually Matters

This is where the decision often gets made.

Gemini 3.1 Pro costs $2.00 per million input tokens and $12.00 per million output tokens.

Claude Opus 4.7 is priced at $5 per million input tokens and $25 per million output tokens — unchanged from Opus 4.6.

GPT-5.5 costs $5.00 per million input tokens and $30.00 per million output tokens.

At equivalent input pricing, Gemini 3.1 Pro costs 60% less than the other two flagships. At 10 million output tokens per month, Gemini comes in at roughly $120, Opus 4.7 at $250, and GPT-5.5 at $300. For high-volume workloads where Gemini's benchmark profile is sufficient, that gap is real budget.

One important caveat on Opus 4.7: the new tokenizer can use roughly 1.0–1.35x more tokens than Opus 4.6 depending on content. Replay real prompts before assuming the list price is your actual cost.

On GPT-5.5: cached input tokens drop to $0.50 per million — a tenth of the standard rate. Cache your system prompts and tool schemas on any multi-turn workflow.

Why This Three-Way Split Is a Bigger Deal Than It Looks

The 2024 playbook was: pick the smartest model, use it for everything. That playbook is dead.

The April 2026 frontier is differentiated enough that routing by task type is now the correct architecture. GPT-5.5 on terminal and browser tasks, Opus 4.7 on complex multi-file coding and tool orchestration, Gemini 3.1 Pro on research, video, and long-context analysis — that's not hedging, it's the optimal engineering decision given where benchmarks actually sit.

An IDC analyst framed the structural dynamic plainly: no single model wins everywhere, which is healthy for the ecosystem and gives developers real choices based on specific needs. The developers who treat model selection as a routing problem — rather than a loyalty problem — will ship better products at lower cost.

Access and Availability

GPT-5.5 is live in ChatGPT for Plus, Pro, Business, and Enterprise users. API access (gpt-5.5) is available now through OpenAI's platform at $5/$30 per million tokens.

Claude Opus 4.7 (claude-opus-4-7) is generally available via the Anthropic API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry at $5/$25 per million tokens.

Gemini 3.1 Pro is available in developer preview via Google AI Studio, Vertex AI, and Gemini CLI at $2/$12 per million tokens (under 200K context).

There is no universal winner in April 2026. There are three strong models with distinct profiles, real price differences, and specific workloads where each one is the right default. The engineers who benchmark their actual tasks against all three will build better systems than the ones who follow lab marketing. Start there.

Follow for more coverage on MCP, agentic AI, and AI infrastructure.

Google Just Killed Vertex AI. Here's What the Gemini Enterprise Agent Platform

Om Shree — Sat, 25 Apr 2026 03:31:36 +0000

Vertex AI has been Google Cloud's AI development platform since 2021. On April 22, 2026, at Google Cloud Next in Las Vegas, Google retired it — not with a deprecation notice, but with a full rebrand and architectural overhaul. Going forward, all Vertex AI services and roadmap evolutions will be delivered exclusively through Agent Platform. If you're building on Google Cloud's AI stack, the ground just shifted.

The Problem It's Solving

Vertex AI was built for a different era. In the early days of generative AI, building safe and reliable business tools took massive engineering effort and a high tolerance for trial and error. Vertex handled that well — model selection, fine-tuning, deployment. But it was never designed for what enterprise AI has actually become: fleets of autonomous agents operating across dozens of systems simultaneously, often without proper security or governance guardrails.

The gap is real. You can build a capable agent today without much trouble. Governing it — knowing what it's doing, what it has access to, whether it's behaving as intended — is a different problem entirely. Anthropic has Managed Agents, which cover runtime and memory but leave governance and observability to third parties. Google is betting that owning that full stack is the differentiator.

How the Gemini Enterprise Agent Platform Actually Works

The platform is organized around four pillars: Build, Scale, Govern, and Optimize. Each maps to a concrete set of tools, not just marketing categories.

Build covers the development surface. Agent Studio provides a low-code visual canvas for designing, prototyping, and managing agent reasoning loops. The Agent Development Kit (ADK) handles code-first development of complex agents. Agent Garden gives developers a library of prebuilt agents and templates. And Model Garden provides access to over 200 foundation models — including Gemini 3.1 Pro, Gemma 4, and third-party models like Anthropic's Claude Opus, Sonnet, and Haiku.

A significant ADK upgrade ships with this release. More than six trillion tokens are processed monthly through ADK. The new graph-based framework lets you organize agents into a network of sub-agents, defining clear, reliable logic for how they collaborate on complex problems.

Scale is handled by Agent Runtime, which is rebuilt for a specific and important use case: long-running agents that maintain state for days at a time, backed by Memory Bank for persistent, long-term context. This is where Google draws a real line against stateless chat-based architectures. Payhawk is already using Memory Bank so their Financial Controller Agent recalls user habits and auto-submits expenses, cutting submission time by over 50%.

Govern is where this platform separates from everything else on the market. Three components do the work:

Agent Identity gives every agent a unique cryptographic ID, creating a clear auditable trail for every action it takes, mapped back to defined authorization policies. Think of it as IAM, but for agents rather than humans — SPIFFE-formatted, natively integrated.

Agent Registry provides a single source of truth for the enterprise: it indexes every internal agent, tool, and skill, ensuring only governed and approved assets are available to your users.

Agent Gateway acts as the air traffic control for your agent ecosystem — providing secure, unified connectivity between agents and tools across any environment, while enforcing consistent security policies and Model Armor protections to safeguard against prompt injection and data leakage.

Optimize closes the loop with Agent Simulation, Agent Evaluation, and Agent Observability. Multi-Turn AutoRaters and Online Evaluation for live traffic give systematic quality assessment. The Unified Trace Viewer provides detailed visibility into agent reasoning and performance for debugging.

What Teams Are Actually Using It For

The customer quotes in the announcement are more concrete than typical launch testimonials, which makes them worth citing.

Comcast rebuilt the Xfinity Assistant using ADK — moving from scripted automation to conversational, generative troubleshooting. Color Health built a Virtual Cancer Clinic that uses Agent Runtime to check screening eligibility, connect patients to clinicians, and schedule appointments at scale. L'Oréal is arguably the most technically interesting case: their Beauty Tech Agentic Platform uses ADK for agent orchestration, and connects agents to their data sources via Model Context Protocol (MCP), securely linked to their core operational applications.

PayPal is also live with Agent Payment Protocol (AP2), using it as the foundation for trusted agent-initiated payments. That's not a demo — that's commerce infrastructure.

More than 85% of OpenAI's workforce uses Codex every week was one of GPT-5.5's big enterprise claims. Google's equivalent signal here is six trillion tokens per month through ADK alone. The scale is real.

Why This Is a Bigger Deal Than It Looks

The headline is governance. Every serious enterprise blocker for production agentic AI comes back to the same questions: Who authorized this agent to do that? What did it actually do? Can we audit it? Can we revoke it? Until this week, the honest answer in almost every platform was "partially, with custom tooling."

An IDC analyst framed Google's actual differentiation clearly: "Google has entrenched hardware, developer tools to build and manage agents, and an end-user AI app in Gemini — no one else has those three. That full lifecycle is what they're really hoping differentiates them."

The MCP integration is also worth flagging for this audience specifically. Agent Gateway and Agent Registry natively support MCP servers — meaning any tool you've already built using the Model Context Protocol can be registered, governed, and exposed to agents through the same identity and policy system. That's a significant win for developers who've already built on MCP.

Developers currently building on Vertex AI keep working in the same console, but the product has a different name and incorporates components that did not exist before: runtimes for long-running agents, persistent memories, registries with cryptographic IDs, security gateways, and simulation tools. The migration surface is low. The capability delta is not.

Availability and Access

Announced at Google Cloud Next on April 22, 2026, the platform brings together the Gemini Enterprise app, the Gemini Enterprise Agent Platform, and a partner marketplace that lets companies deploy third-party agents from vendors including Oracle, Salesforce, ServiceNow, Adobe, and Workday inside the same governed environment.

You can access the platform directly at Agent Platform in the Google Cloud console. The ADK is available at docs.cloud.google.com. Full documentation for the governance layer — Agent Identity, Gateway, and Registry — is at the Agent Platform overview.

Google says the new Gemini Enterprise features will roll out over the coming months. Not everything is GA today — build your evaluation timeline accordingly.

The enterprise agentic AI race has moved past "which model is smartest" into "which platform can actually govern thousands of agents at once." Google just made the most complete argument yet that it has an answer. Whether the execution matches the architecture is what the next six months will show.

Follow for more coverage on MCP, agentic AI, and AI infrastructure.