\n\n\n\n My Journey Scaling Agent Infrastructure from PoC to Production - AgntUp \n

My Journey Scaling Agent Infrastructure from PoC to Production

📖 9 min read1,773 wordsUpdated Apr 15, 2026

Hey everyone, Maya here, back on agntup.com! Today, I want to talk about something that’s probably keeping a lot of you up at night, especially those of us playing with agent deployments: scaling. And not just scaling in theory, but scaling when things get real – when your proof-of-concept explodes into an actual, critical service. Specifically, I want to dive into the often-overlooked, sometimes terrifying, but utterly necessary journey of scaling your agent infrastructure from a handful of VMs to a truly dynamic, production-ready environment using Kubernetes. Because let’s be honest, those manual SSH sessions get old, fast.

My own journey into this particular rabbit hole started about a year ago. We had this fantastic internal agent, codenamed “Chameleon,” designed to monitor the health and performance of our various microservices across several environments. For a good six months, it was a darling. I had it running on maybe 20 VMs, mostly in a staging environment, and I’d just SSH in, `git pull`, `npm install`, `pm2 restart`. Easy peasy. I even had a little script for it. I felt like a wizard.

Then came the mandate: Chameleon needed to go live, *everywhere*. We were talking hundreds of nodes, across multiple cloud providers, with varying security requirements and network topologies. My wizard hat suddenly felt incredibly silly. My little SSH script wasn’t going to cut it. The thought of manually updating hundreds of agents, one by one, gave me actual hives. That’s when I knew: it was time to embrace the orchestrator. And for us, that meant Kubernetes.

From Pet VMs to Cattle Pods: The Mindset Shift

The biggest hurdle wasn’t technical; it was psychological. For so long, those VMs running Chameleon were “pets.” I knew each one by name (or at least by IP address), I knew its quirks, its history. When one misbehaved, I’d nurse it back to health. Kubernetes, however, forces you into a “cattle” mindset. Nodes are ephemeral. Pods are disposable. If something goes wrong, you don’t fix it; you replace it. This shift is crucial for scaling. You need to design your agents, and your deployment strategy, with this impermanence in mind.

For Chameleon, this meant re-evaluating everything. Was its state stored locally? (Spoiler: it was, initially). Could it pick up where it left off if it suddenly disappeared and reappeared on a new node? Could it handle configuration updates dynamically without a full restart? These were the questions that kept me up. And frankly, if your agents aren’t designed with this in mind, scaling them with Kubernetes will be a world of pain. So, first step: prepare your agents for the cattle ranch.

Kubernetes Basics for Agent Deployment: DaemonSets to the Rescue

Alright, let’s get into the technical bits. When you’re deploying an agent that needs to run on *every* node in your cluster, a standard Deployment won’t quite cut it. A Deployment ensures a certain number of identical pods are running, but it doesn’t guarantee one per node. That’s where Kubernetes DaemonSets come in. A DaemonSet ensures that all (or some) nodes run a copy of a pod. As nodes are added to the cluster, new pods are automatically added to them. As nodes are removed from the cluster, those pods are garbage collected. It’s literally tailor-made for agents.

Let me show you a simplified example of how we structured our Chameleon DaemonSet. This isn’t the full, production-hardened version (that would be a very long YAML file!), but it captures the essence.

Example: Chameleon Agent DaemonSet


apiVersion: apps/v1
kind: DaemonSet
metadata:
 name: chameleon-agent
 labels:
 app: chameleon-agent
spec:
 selector:
 matchLabels:
 app: chameleon-agent
 template:
 metadata:
 labels:
 app: chameleon-agent
 spec:
 # This is crucial: run on all nodes (can be filtered with nodeSelector)
 hostNetwork: true # Sometimes agents need direct network access
 dnsPolicy: ClusterFirstWithHostNet # For hostNetwork, often needed
 containers:
 - name: chameleon-agent-container
 image: agntup/chameleon-agent:v1.3.0 # Your agent image
 imagePullPolicy: Always
 env:
 - name: CHAMELEON_API_KEY
 valueFrom:
 secretKeyRef:
 name: chameleon-secrets
 key: api-key
 - name: CHAMELEON_ENVIRONMENT
 value: "production"
 resources:
 limits:
 memory: "128Mi"
 cpu: "100m"
 requests:
 memory: "64Mi"
 cpu: "50m"
 volumeMounts:
 - name: agent-config
 mountPath: /etc/chameleon/config
 - name: var-log
 mountPath: /var/log/chameleon # If your agent writes to host logs
 volumes:
 - name: agent-config
 configMap:
 name: chameleon-agent-config
 - name: var-log
 hostPath:
 path: /var/log/chameleon # Persistent log storage on host
 type: DirectoryOrCreate
 tolerations:
 - key: "node-role.kubernetes.io/master"
 operator: "Exists"
 effect: "NoSchedule"
 # If you want to run on master nodes too (be careful!)
 # For agents that monitor the cluster itself, this is often necessary.

A few things to note here:

  • hostNetwork: true and dnsPolicy: ClusterFirstWithHostNet: This is often necessary for agents that need to inspect network traffic or bind to specific ports on the host. Be aware of the security implications.
  • imagePullPolicy: Always: Ensures you always get the latest image. Great for rapid iteration, but remember to tag your images properly for rollback.
  • env: Configuration via environment variables is a common pattern. Using secretKeyRef for sensitive data like API keys is a must.
  • resources: Don’t skip these! Agents, while often lightweight, still consume resources. Define limits and requests to prevent runaway agents from starving your nodes.
  • volumeMounts and volumes: We use a ConfigMap for static configuration and a hostPath for persistent logs. If your agent needs to store state, think carefully about how to manage it. For Chameleon, we pushed all metrics to a central observability platform, so local state was minimal.
  • tolerations: By default, DaemonSets won’t run on master nodes. If your agent needs to monitor master nodes (e.g., a cluster-level health agent), you’ll need to add appropriate tolerations.

Rolling Updates and Rollbacks: The Sanity Savers

My biggest fear, beyond manually updating hundreds of agents, was breaking hundreds of agents with a bad update. My old SSH script had a rudimentary `git revert` option, but it was manual and slow. Kubernetes handles this gracefully with rolling updates for DaemonSets. When you update the image tag in your DaemonSet definition, Kubernetes intelligently replaces pods one by one. If things go sideways, you can roll back to a previous version with a single command.

For example, if I update my Chameleon agent image to `v1.3.1`:


kubectl set image ds/chameleon-agent chameleon-agent-container=agntup/chameleon-agent:v1.3.1

Kubernetes will start replacing pods. If I discover a critical bug, I can roll back:


kubectl rollout undo ds/chameleon-agent

This is where the “cattle” mindset truly shines. You’re not fixing individual agents; you’re defining the desired state, and Kubernetes makes it happen. The peace of mind this brings is immense, especially when you’re responsible for a critical system.

Advanced Scaling: NodeSelectors and Taints/Tolerations for Granularity

Sometimes, you don’t want your agent to run on *every* node. Maybe you have specialized GPU nodes that don’t need a general-purpose monitoring agent, or you have different environments within the same cluster (e.g., “secure-zone” nodes). This is where nodeSelectors and taints/tolerations come into play.

nodeSelectors allow you to specify that your DaemonSet pods should only run on nodes that have a specific label. For instance, if you only want Chameleon to run on nodes labeled `agent-type: full-monitor`:


spec:
 template:
 spec:
 nodeSelector:
 agent-type: full-monitor
 containers:
 # ... rest of your container definition

Then, you’d apply this label to your desired nodes:


kubectl label node <node-name> agent-type=full-monitor

Taints and Tolerations work in conjunction. A taint on a node repels pods unless the pod has a matching toleration. This is often used for dedicated nodes (e.g., “no-schedule” for control plane nodes, or dedicated CI/CD runners).

We used this for our specialized “edge” agents. These agents were heavier, required specific hardware access, and we only wanted them on nodes explicitly designated for edge processing. We’d taint those nodes, and only the edge agent DaemonSet would have the corresponding toleration.

The Observability Story: When Agents Go Rogue

Scaling agents isn’t just about deploying them; it’s about knowing what they’re doing and whether they’re healthy. With hundreds of agents, you can’t manually check logs anymore. This is where a robust observability stack becomes non-negotiable.

For Chameleon, we integrated heavily with Prometheus and Grafana. Each agent exposed a `/metrics` endpoint, and a Prometheus instance scraped these metrics. We built dashboards to monitor:

  • Agent health (up/down status)
  • Resource consumption (CPU, memory per agent)
  • Latency of data reporting
  • Specific metrics gathered by the agents themselves

Alerting was crucial. We set up alerts for agents that stopped reporting, agents consuming excessive resources, or agents reporting anomalous data. This allowed us to shift from reactive firefighting to proactive management. When an agent went “rogue” (a memory leak, a runaway process), our alerts would fire, and Kubernetes’ self-healing properties often meant the pod would be restarted before we even had to intervene. If it was a systemic issue, the alerts would point us to a problem with the DaemonSet definition itself, allowing for a quick rollback.

Actionable Takeaways for Your Agent Scaling Journey

So, you’re ready to scale your agents? Here’s my distilled advice:

  1. Embrace the Cattle Mindset Early: Design your agents to be stateless or to handle ephemeral storage gracefully. Assume they can disappear and reappear at any moment.
  2. Start with DaemonSets for Node-Level Agents: They are your best friend for ensuring an agent runs on every relevant node.
  3. Resource Limits and Requests are Your Shield: Define them for your agent pods to prevent resource contention and ensure stability. Your cluster will thank you.
  4. Leverage ConfigMaps and Secrets: Externalize your agent configuration and sensitive data. Never hardcode.
  5. Master Rolling Updates and Rollbacks: Practice these. Your ability to quickly update and revert is paramount for sanity in a scaled environment.
  6. Use NodeSelectors and Taints/Tolerations for Granular Control: Don’t just deploy everywhere if you don’t need to. Target your agents precisely.
  7. Build a Robust Observability Stack: You can’t manage what you can’t see. Metrics, logs, and alerts are non-negotiable for hundreds of agents.
  8. Automate Everything Possible: From image builds (CI/CD pipelines!) to DaemonSet deployments, automation reduces human error and speeds up your iteration cycles.

Scaling agents from a handful of VMs to a large Kubernetes cluster feels like a giant leap, but it’s a necessary one if your agents are providing real value. It forces you to think about robustness, automation, and observability in ways that manual management never would. It was a steep learning curve for me and Chameleon, but standing here today, with hundreds of agents humming along happily, I wouldn’t have it any other way. The initial pain is absolutely worth the long-term gain in stability, maintainability, and peace of mind.

Got your own agent scaling stories or Kubernetes tips? Drop them in the comments! I’d love to hear how you’re tackling these challenges.

🕒 Published:

✍️
Written by Jake Chen

AI technology writer and researcher.

Learn more →
Browse Topics: Best Practices | CI/CD | Cloud | Deployment | Migration
Scroll to Top