Running MCP Servers on EKS: Architecture for Production Agentic Workloads
MCP servers are spreading from local dev tools into production infrastructure. The default deployment patterns work fine for a single user. They fall apart when you put MCP behind real traffic on EKS. This post is about what changes when you run MCP servers as production workloads, what breaks, and how to design around it.
For most of 2025, MCP servers lived on developer laptops. You connected your editor or your local Claude client to a stdio process, the process held a session, you were the only user. Production was not a concern because there was no production.
That changed quickly. By the back half of 2025, internal teams started exposing MCP servers as shared infrastructure. Wrap an internal API, an EKS cluster, a Postgres database, or a SaaS account in MCP, deploy it, point an agent or a fleet of agents at it. The pattern is powerful and the operational model is genuinely new. It is not “another stateless HTTP API”. It is not “a long-running worker”. It is somewhere in between, and that in-between is where most teams currently get burned.
This post is about what production MCP looks like on EKS once you stop pretending it is a normal workload.
Table of contents
Why this is a real problem
Three things make MCP awkward to deploy on Kubernetes the way you would deploy a normal API.
First, sessions matter. Most useful MCP servers maintain per-client state across multiple tool calls: a database connection, a workspace context, a chain of resources the agent has already read. Stateless load balancing across replicas breaks that.
Second, traffic patterns are bursty and long-lived. An agent might issue 30 tool calls in 8 seconds, then sit idle on the same connection for 5 minutes, then issue another burst. ALB idle timeouts, HPA sampling windows, and Karpenter consolidation policies are all calibrated for traffic that does not behave like this.
Third, the blast radius is unusual. An MCP server is, by design, a controlled way to give an LLM agent tool execution rights. The IAM role attached to that pod can read your S3, your Secrets Manager, your RDS. If the server is misconfigured, an agent can ask it to do things you did not intend, with permissions you did not audit.
These three things, together, mean you cannot just kubectl apply -f mcp-server.yaml and call it done. You can. It will look like it works. It will eventually surprise you.
Transport types and what they imply
MCP supports three transports today: stdio, HTTP+SSE (legacy), and Streamable HTTP. The choice affects almost every other design decision.
flowchart LR
subgraph LOCAL["stdio (in-process)"]
direction TB
P1[Client process] -->|stdin/stdout| S1[MCP server
subprocess]
end
subgraph LEGACY["HTTP+SSE (legacy)"]
direction TB
C2[Agent] -->|POST /messages| S2[MCP server]
S2 -->|SSE stream| C2
end
subgraph PROD["Streamable HTTP (production)"]
direction TB
C3[Agent] -->|"POST /mcp
+ session ID"| S3[MCP server]
S3 -->|"streamed response"| C3
end
stdio is the simplest. The MCP server is a process that reads from stdin and writes to stdout. It assumes one client per process. On Kubernetes, you usually run stdio MCP servers as a sidecar to a long-running client process, not as a network service. If you find yourself trying to expose stdio over a network, stop. There is a transport for that and it is not stdio over a TCP shim.
HTTP+SSE was the original networked transport. It is still supported but no longer recommended for new servers. Skip it if you can.
Streamable HTTP is what you want for production. The server exposes a single HTTP endpoint that handles requests and uses server-sent events to stream responses. Sessions are explicit, identified by a session ID header. This is the transport you can put behind an ALB without wanting to throw your laptop into the canal.
For the rest of this post, “MCP server” means a Streamable HTTP server unless I say otherwise.
The deployment patterns that actually work
There are three architectural patterns I have seen succeed in production EKS environments. Each one fits a different use case. Picking the right one matters because the cost of getting it wrong is not “this is slower”, it is “this does not scale at all”.
flowchart TB
subgraph P1["Pattern 1: Per-tenant Deployment"]
direction LR
T1A[Tenant A agent] --> S1A[MCP-A
Deployment]
T1B[Tenant B agent] --> S1B[MCP-B
Deployment]
S1A --> R1A[role-tenant-A]
S1B --> R1B[role-tenant-B]
end
subgraph P2["Pattern 2: Shared with auth-scoped tools"]
direction LR
T2A[Tenant A agent] --> SHARED[MCP shared
Deployment]
T2B[Tenant B agent] --> SHARED
SHARED -->|AssumeRole tenant A| RA2[role-A]
SHARED -->|AssumeRole tenant B| RB2[role-B]
end
subgraph P3["Pattern 3: Per-session ephemeral"]
direction LR
T3[Agent connects] --> CTRL[Operator]
CTRL -->|provisions| POD3[MCP Pod
session-xyz]
POD3 -.->|reaped on disconnect| GONE[gone]
end
Pattern 1: Per-tenant deployment
One MCP server Deployment per tenant, with its own namespace, its own service account, its own IAM role. The agent for tenant A talks only to tenant A’s MCP server. Resource isolation is at the Kubernetes namespace and IAM role level.
This is the simplest pattern to reason about and the easiest to get right on day one. The downsides show up later: you have N copies of the MCP server image consuming resources even when idle, and your platform team is now in the business of provisioning a small Kubernetes deployment every time someone signs up for the product.
When this fits: small number of tenants (<50), each one with meaningful traffic, each one with distinct IAM scopes. Internal tooling at most companies fits this profile.
When it does not: SaaS-shaped products with hundreds or thousands of tenants. You will spend more on idle pods than on the product.
Pattern 2: Shared server with auth-scoped tools
One MCP server Deployment. Multiple tenants connect to it. Authentication happens at the connection level (OAuth, signed tokens), and the MCP server uses the tenant identity to scope every tool call.
In this pattern, the IAM role on the pod is the union of permissions across all tenants, and the application code is responsible for filtering. This is fragile by default and excellent when done well.
The “done well” part requires:
- Per-tenant IAM role assumption from inside the server (the pod role assumes a tenant-scoped role, then makes the AWS call with that).
- Strict input validation on every tool argument.
- An audit log that records, for every tool call, which tenant invoked it and which underlying AWS API was called with which arguments.
When this fits: high tenant count, low average traffic per tenant, well-understood auth model, mature security review process.
When it does not: when the team’s first instinct is “we will use the same role for everyone”. You are now one prompt injection away from a cross-tenant data leak.
Pattern 3: Per-session ephemeral server
One MCP server Pod per active session. When the agent connects, an Operator (or a custom controller) provisions a Pod, the agent talks to it, and when the session ends the Pod is reaped.
This is the cleanest model for isolation: every session has its own process boundary, its own filesystem, its own credentials lifetime. It is also the most expensive in cluster overhead.
I have only seen this work when sessions are long (hours) and tool calls are heavy (running notebooks, executing user-supplied code, etc.). If your average session is 90 seconds, the Pod startup cost dominates and you should pick pattern 1 or 2.
When this fits: code execution sandboxes, data science notebooks, anything where the session has a meaningful filesystem state.
Authentication and the IAM role question
This is where most production MCP deployments quietly fail their security review.
The default pattern in MCP examples is “the server has access to S3”, with no further specification. In production, you need a precise answer to: which IAM principal does the underlying AWS API call run as.
On EKS, you have two options worth considering.
EKS Pod Identity (the newer one) attaches a role to a Kubernetes service account through the EKS API directly. No OIDC trust policy to configure per cluster, no eks.amazonaws.com/role-arn annotation. The pod gets credentials via the Pod Identity Agent DaemonSet. This is the pattern I would pick today for any new workload, MCP or otherwise.
IRSA (the older one) does the same thing through OIDC federation. It still works, it is still supported, and migrating an existing MCP server from IRSA to Pod Identity is rarely worth the effort if the IRSA setup is healthy.
Either way, the role attached to the pod must be the least permission necessary for the worst-case caller. If the MCP server has 12 tools and 11 of them are read-only and 1 of them deletes things, the role still includes Delete, and any caller can invoke the Delete tool. There is no “this tool is read-only” enforcement at the IAM layer. The application code is the only thing between the agent and the API.
For multi-tenant patterns (Pattern 2), use STS AssumeRole inside the MCP server to drop into a tenant-scoped role for each call. The pod role gains sts:AssumeRole against a list of tenant roles, and nothing else. This is the same pattern AWS uses internally for cross-account services. It is well-understood, it audits cleanly, and it is annoying to set up the first time.
flowchart LR POD["MCP server pod
(Pod Identity)"] --> ROLE_POD["Pod role
sts:AssumeRole only"] ROLE_POD -->|tenant A request| ASSUME_A[AssumeRole
role-tenant-A] ROLE_POD -->|tenant B request| ASSUME_B[AssumeRole
role-tenant-B] ASSUME_A --> RES_A[(Tenant A
S3 / RDS / ...)] ASSUME_B --> RES_B[(Tenant B
S3 / RDS / ...)] ROLE_POD -.->|"all calls logged with
session_id, tenant_id, tool"| AUDIT[(Audit log)]
The trust policy on role-tenant-A only trusts role-pod, which only trusts the EKS Pod Identity association. Three links, each one explicit. Auditing reads cleanly because every log line says which session triggered which AssumeRole.
The anti-pattern: attaching AdministratorAccess “for now”. There is no later. There is only “for now” forever, and your security team will find it during the next audit.
Why Karpenter consolidation hates MCP servers
Consolidation is one of the best features Karpenter has. It packs pods onto fewer nodes when the cluster is underutilized. For most workloads, this is great. For MCP servers, it is a problem you have to design around.
A consolidation event evicts pods. Evicting an MCP server pod kills every session that pod was holding. Long-running agents that were 14 tool calls into a 20-call task drop their context and have to start over.
sequenceDiagram
participant A as Agent
participant M as MCP pod
participant K as Karpenter
participant N as Node
A->>M: open session, tool call 1
M-->>A: response
A->>M: tool call 2..14
M-->>A: responses
Note over K,N: Underutilized: consolidation triggers
K->>N: cordon, drain
N->>M: SIGTERM
M-->>A: connection closed mid-task
Note over A: 14 calls of context lost
A->>A: must restart from scratch
You have three options.
Option A: opt out of consolidation. Add karpenter.sh/do-not-disrupt: "true" to the pod or use a NodePool with consolidation disabled. Cost goes up, sessions survive. Use this for the “per-session ephemeral server” pattern (Pattern 3) where each pod’s lifetime is meaningful.
Option B: design for graceful resumption. Make the MCP server stateless across sessions, with session state stored externally (Redis, Postgres, S3). When a pod is evicted, the agent reconnects to a different pod and resumes from the persisted state. This is the right answer for Pattern 1 and 2, and it is real engineering work. Most teams underestimate it.
Option C: PDBs and disruption budgets that match session length. Set a PDB with maxUnavailable: 0 for the duration of any active session, and use Karpenter’s disruption.budgets to limit consolidation to a small percentage of nodes per hour. This is the middle path, and it works if your sessions are short enough that “no disruption for 5 minutes” is acceptable.
The wrong answer is to do nothing and discover, two months in, that your agent’s context evaporates twice a day around the time Karpenter decides the cluster is underutilized.
Networking: the ALB and the idle timeout
Two ALB defaults bite MCP servers in production.
The connection idle timeout defaults to 60 seconds. An MCP session that goes idle for 61 seconds gets the connection closed. Some clients reconnect cleanly, others do not. Either way, you are introducing a failure mode at the load balancer that the application could have handled differently.
The fix: bump the idle timeout to 300 or 600 seconds for the listener that fronts MCP servers. On the AWS Load Balancer Controller, this is a service annotation:
service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "600"
The other ALB default to know about is cross-zone load balancing. Enable it. Without it, traffic to an MCP server that has pods in 2 of 3 AZs gets distributed unevenly, which combined with sticky sessions produces a frustrating support ticket pattern of “tenant A is fast, tenant B is slow”.
If your MCP server uses sticky sessions (it probably should not, see Pattern 2 above, but if it does), use the ALB target group’s stickiness setting, not application-cookie stickiness. Application cookies require the agent to handle them, and most don’t.
Observability: tracing tool calls
A production MCP server deserves the same observability as any other production API. The thing that changes is the unit of work.
For a normal HTTP API, you trace the request: one span, one user-visible operation. For an MCP server, the unit is the tool call: one span per tool invocation, with attributes for the tool name, the tenant, the session ID, and the size of the input and output.
OpenTelemetry covers this fine. The implementation detail that matters is propagating the trace context across the MCP transport. Streamable HTTP is normal HTTP, so standard W3C traceparent headers work, but the MCP SDK does not propagate them by default. You have to wire it in.
Once it is wired in, your observability stack can correlate:
- Which tenant’s agent triggered which AWS API calls.
- Which tool calls are the slow ones.
- Which tool calls failed and why.
- The shape of an agent’s interaction over time (5 reads, 1 write, 2 reads, etc.).
That last one is invaluable for capacity planning. You will discover that 90% of your traffic is one tool, and you should optimize that one.
What I would not do
A few patterns I have seen attempted that I would steer away from.
Running stdio MCP servers behind a network proxy. Someone always tries this. It almost works. It breaks in subtle ways at session boundaries. Use Streamable HTTP.
One Service per MCP server, each with its own ALB. ALBs are not free, and you do not need one per server. Use Ingress with path-based or host-based routing, and let one ALB front many MCP services. Save the dedicated ALBs for workloads that justify them.
Storing session state in the pod’s local filesystem. Pods get evicted, scheduled, rescheduled. State on disk is state you have lost. Use Redis, Postgres, or S3.
Skipping the audit log. Every tool call should produce a structured log entry: tenant, session, tool, input hash, output size, latency, outcome. When something goes wrong six months from now, you will need this. Build it on day one.
Letting the MCP server’s IAM role grow over time. Every new tool that needs an AWS permission gets added to the role and never removed. After 18 months, the role is a graveyard. Audit the role quarterly, and remove permissions when tools are deprecated.
What I would build today
If I were standing up a production MCP server on EKS in 2026, the design I would aim for looks like this.
Streamable HTTP transport. One Deployment behind an Ingress, fronted by an ALB with connection-idle-timeout: 600. Pod Identity for the AWS role, scoped tightly. STS AssumeRole inside the server for tenant-scoped operations.
Session state in Redis or DynamoDB, never in the pod. PDB with maxUnavailable: 1, Karpenter NodePool with consolidation budgets capped at 10% per hour.
OpenTelemetry instrumentation with one span per tool call, propagated traces from the agent through the MCP server to the underlying AWS APIs. Audit log on every tool invocation, structured, shipped to a long-retention store separate from the application logs.
Per-tenant IAM roles for any tool that touches tenant data, even if you start with one tenant. The role architecture is much harder to retrofit than to build correctly.
That is the platform. The MCP server itself is a few hundred lines of code on top of an SDK. The interesting engineering is in the operational envelope you wrap it in.
Takeaway
MCP servers in production are still new enough that “the way everyone does it” has not converged. The patterns that work look more like “stateful API behind a careful auth boundary” than like “stateless service”. Treat them like that and you will avoid most of the foot-guns.
Treat them like a normal HTTP service and you will discover, one bad afternoon, that your agent’s session disappeared because Karpenter consolidated a node, and the audit log you skipped will not tell you why.
