What Running 35 AI Agents Taught Us About Business Automation
10 systems, 90+ agent roles, 13,570 lines of Python. What works, what breaks, and what we would do differently.
Last year, Odisea — the technology lab behind Synaptic — decided to stop advising clients on AI and start using it to run its own operations. Not as a proof of concept. Not as a weekend hackathon. As the actual operating infrastructure for a multi-unit organization spanning podcast production, legal research, sales operations, institutional partnerships, and academic research across six countries.
Twelve months later, we have 90+ agent roles defined across 10 distinct systems, 13,570 lines of production Python, and enough scar tissue to know what works, what fails, and what the consulting industry gets wrong about business automation.
This is what we learned.
The Setup
Odisea runs six business units: a podcast network (La Odisea), a legal technology practice, a DeFi sales operation (Pan.Tech), an open infrastructure lab (ODIL), a research division covering AI governance and Latin American economics, and Synaptic itself. Each unit has its own pipeline, its own stakeholders, and its own operational rhythm.
Rather than hiring a traditional ops team, we built AI agent systems for each unit. Not chatbots. Not copilots. Autonomous systems that execute multi-step workflows, make routing decisions, report outcomes, and escalate only when they hit the boundaries of their authority.
Here’s what’s running in production right now:
Legal Tech Daemon: 10 specialized agents handling a 37-task legal research backlog for Ecuadorian law. Agents include a corpus engineer, product architect, compliance specialist, market researcher, and domain expert. The system runs autonomous sprint cycles with quality gates, content scoring, and a $20/day budget cap. 33 of 37 tasks completed without human intervention.
Penelope: A personal AI agent managing podcast production. Monitors email, drafts replies with Slack-based human approval, searches for guests, manages calendar scheduling, and tracks the pipeline through Notion. 14 tools, polling every 5 minutes.
Pan.Tech Sales Pipeline: 7 specialist agents managing 92+ prospects in a Notion CRM, handling competitive intelligence, meeting tracking, and action items for a DeFi API product.
Research Systems: 6 agents across three research areas (Latin American Dynamism, AI Governance, AI & Crypto) with 16 custom skills and 4-gate quality control: source verification, voice check, adversarial review, publication approval.
Ventures / Founder Factory: The most complex system. Give it a business idea and it generates a complete company structure with ~35 agents across 10 departments, 4-layer memory, infrastructure provisioning for 9 platforms (Cloudflare, DigitalOcean, GitHub, Vercel, HubSpot, PostHog, Resend, Crisp), and an autonomous operating daemon. Synaptic itself was spawned from this factory.
Lesson 1: The Hard Part Isn’t Building the Agent
Building a single AI agent is trivial. Any competent developer can wire up an LLM to a set of API calls in a weekend. The hard part is everything that happens after the demo.
Quality control. Cost management. Error recovery. Context persistence across sessions. Coordination between agents that share a workflow but have different objectives. Graceful degradation when an upstream API goes down. Human escalation that doesn’t become a bottleneck.
Our Legal Daemon went through three major rewrites before it stopped producing garbage. The first version had no quality gates. Agents would generate 2,000-word legal analyses that contained confident-sounding nonsense: correct legal terminology arranged in meaningless patterns. We built 50+ garbage detection patterns (checking for circular definitions, unsupported conclusions, and placeholder language that sounds authoritative) and a content scoring system that rejects anything below a 0.4 quality threshold. We added a retry cap of 3 attempts per task, after which the task gets flagged as blocked rather than endlessly regenerating trash.
The lesson: agent development is 20% building and 80% quality engineering. If your AI consultancy is showing you a demo and calling it a deployment, find a different consultancy.
Lesson 2: Multi-Agent Coordination Is an Organizational Design Problem
When we deployed the Research Systems, we initially had the same agent write analyses and review them. The result was predictable: the agent rubber-stamped its own work. It took us one embarrassing publication of a poorly sourced article to institute a hard rule: the agent that writes can never be the agent that reviews.
This isn’t a technical constraint. It’s an organizational design principle that happens to apply to software. We ended up with a three-stage pipeline (research-analyst drafts, source-reviewer verifies citations, quality-controller runs adversarial review) that mirrors how a well-run research department operates. The agents have separation of concerns not because the framework requires it, but because sloppy handoffs produce sloppy work regardless of whether the worker is human or artificial.
The Ventures factory takes this further. Each spawned company gets 10 departments with phase-gated activation. During validation, only Strategy, Sales, and Marketing are active. Product and Engineering come online during the build phase. Customer Success and Operations activate at launch. Finance and Talent at scale. We didn’t design this because it’s technically elegant. We designed it because activating all 35 agents from day one creates a coordination nightmare where agents generate work for departments that have no business existing yet.
The lesson: the best multi-agent architectures borrow from organizational theory, not from distributed systems papers. Conway’s Law applies to AI agents just as much as it applies to engineering teams.
Lesson 3: Budget Controls Are Not Optional
Our Legal Daemon has a hard cap of $20 per day in LLM API costs. The Ventures factory tracks token usage per department and halts sprints when daily budget is exceeded. Every system we deploy has cost visibility built into the reporting layer.
This sounds obvious. It is not standard practice in the AI consulting world.
Here’s why it matters: the difference between a useful AI system and a financial liability is often a single runaway loop. An agent that encounters an ambiguous task and retries endlessly can burn through hundreds of dollars in API costs in hours. A multi-agent system where agents trigger each other without dampening can create exponential cost cascades.
We learned this the expensive way when an early version of our research pipeline got into a cycle where the analyst agent kept revising its output based on the reviewer agent’s feedback, each revision triggering a new review, each review generating new revision suggestions. Twelve iterations later, the output was worse than the original draft and we’d spent 40x the expected compute budget.
Now every agent has: a maximum retry count (usually 3), a per-sprint budget limit, and a circuit breaker that flags the task as blocked rather than continuing to spend money. The Ventures factory goes further with a per-department budget allocation that rolls up into a company-wide daily cap.
Lesson 4: Memory Is the Moat
The most underestimated component in our entire stack is the memory system. The Ventures factory uses a 4-layer architecture: episodic memory (SQLite for recording what happened), semantic memory (markdown files for capturing what we know), procedural memory (playbooks for encoding how we do things), and strategic memory (lessons learned plus a decision journal for preserving why we made the choices we made).
Before we built this, every agent sprint started from scratch. Agents would re-research topics they’d already analyzed. They’d make the same mistakes they’d made in previous sprints. They’d propose strategies that had already been tried and rejected.
After implementing persistent memory, agent output quality improved measurably. Not because the agents got smarter, but because they stopped wasting cycles rediscovering what they’d already learned. The research systems now maintain a shared findings index and a published works catalogue. When a research-analyst starts a new analysis, it first checks what the organization already knows about the topic.
For our OpenClaw deployment, we use QMD (a retrieval system combining BM25 keyword search with vector embeddings and reranking) that auto-indexes the workspace every 5 minutes. The result is an agent that accumulates institutional knowledge the way a long-tenured employee does, except it never forgets and it can surface relevant context in milliseconds.
This has direct implications for consulting: when we deploy agent systems for clients, the value compounds over time. A system deployed for 6 months is meaningfully better than the same system on day one, because it has accumulated context about the client’s business that would take a new hire weeks to absorb.
Lesson 5: The Integration Layer Eats Most of the Calendar
If you asked me to estimate the time breakdown for a typical agent deployment, it would look like this:
- Understanding the client’s existing workflow: 25%
- Building integrations to their tools (Slack, email, CRM, Google Workspace, Notion, custom APIs): 40%
- Agent logic and prompt engineering: 15%
- Quality gates, monitoring, and error handling: 15%
- Testing and handoff: 5%
Forty percent of the work is plumbing. Not because integration is inherently difficult, but because real business tools have quirks, rate limits, authentication flows, and undocumented behaviors that you only discover in production.
Our Notion integration, for example, has two separate authentication paths because one of them intermittently fails with “Invalid refresh token” errors. Our Slack integration routes through two different bot identities (Penelope for personal use, Ulises for operations) because mixing the two creates confusion about who’s saying what. The Google Workspace integration required a full MCP (Model Context Protocol) server configuration with separate OAuth flows for each service.
None of this is glamorous work. It is where most “AI transformation” projects actually stall. The consultancy shows a beautiful demo of an agent answering questions from a test dataset, and then the project dies in a swamp of API authentication issues and data format mismatches.
Lesson 6: Humans in the Loop Need Designed Touch Points
Penelope, our podcast production agent, has a human-in-the-loop system that works like this: when the agent drafts an email reply, it posts the draft to Slack with three buttons (Approve, Reject, Edit). The human reviews the draft and makes a decision. The agent only sends the email after explicit approval.
This works because the interaction is designed around a specific decision at a specific moment. The human doesn’t need to supervise the agent’s research or reasoning. They just need to answer one question: “Is this email ready to send?”
Contrast this with agent systems that expose every intermediate step to human review. We tried that with the Legal Daemon early on. The orchestrator would post each agent’s output to a Slack channel for review before passing it to the next agent. Within two days, the review channel had 200+ unread messages and nobody was reading any of them. The human oversight became a rubber stamp.
The lesson: human oversight works when it’s concentrated at high-stakes decision points and invisible everywhere else. Every approval request that isn’t genuinely important dilutes the ones that are.
Lesson 7: Start With One Agent, Not Ten
The Ventures factory can spawn 35 agents across 10 departments. When we actually deploy for a business, we start with one. A single agent doing one well-defined task within one department.
Our pilot structure reflects this: a Starter engagement ($5K, 2 weeks) deploys one agent with one integration. A Growth engagement ($10K, 4 weeks) expands to 3-5 agents covering an end-to-end workflow. Enterprise ($15K, 6 weeks) spans 2-3 departments.
This isn’t a sales tactic. It’s a lesson from our own experience. When we tried to deploy multiple systems simultaneously, the debugging surface area grew exponentially. When one agent failed, it was hard to tell whether the problem was in the agent’s logic, the integration, the upstream data, or a cascading failure from another agent.
Starting with one agent, getting it to production quality, and then expanding is faster than deploying everything at once and spending weeks debugging interactions between half-built systems.
What This Means for Businesses
The AI consulting market is projected to hit $24.6 billion globally this year, with the LATAM AI market growing at 22% annually toward a projected $34.6 billion by 2034. The agent market specifically (systems that act autonomously, not just respond to prompts) is expected to grow from $7.84 billion to $52.6 billion by 2030.
Most of what’s being sold as “AI transformation” is still slide decks and proof of concepts. The Big Four charge $500K+ for engagements that take 6-18 months. The AI agent platforms sell self-serve tools that require the client to build everything themselves. The freelance AI engineers deliver code without operational infrastructure.
What’s missing is the middle: firms that actually deploy autonomous systems, at mid-market pricing, with the operational maturity to keep them running.
That’s what we built for ourselves. Every system in this article is production code running on real infrastructure, processing real data, producing real business outcomes. The Legal Daemon completed 33 of 37 assigned research tasks. The sales pipeline manages 92+ active prospects. The research systems produce publication-ready analyses with four-gate quality control.
We didn’t build these systems to impress anyone. We built them because we needed them. And the fact that they work, with all the messy, unglamorous quality engineering that “working” requires, is the strongest argument we can make for what AI automation actually looks like when you move past the demo.
Synaptic turns businesses into AI-native organizations. We start where the demo ends. synaptic.so