The part the demo never shows
Getting a recurring AI job started is easy now. Tell an assistant to check your inbox every morning, or wire up a scheduled task, and it works on the first try. The trouble is that “starts” and “keeps running reliably” are two different problems, and the gap between them is invisible until the morning a run silently doesn’t fire. Nothing errors. Nothing tells you. You just quietly stop getting the thing you came to depend on — and you find out when a customer, a deadline, or your own boss finds out first.
So the real question isn’t “can I build this?” It’s “what does it take to trust this is still running in six months?” That answer splits into two roads.
Road A — bolt it on with native tools
The native scheduling and automation built into AI assistants and agent platforms is genuinely useful. Zero infrastructure, set up in minutes, nothing to host. For low-stakes work it’s the right answer. But it carries three structural limits you’re accepting whether you notice them or not:
- You often can’t tell if it ran. Many native tools give you no run log and no failure alert. The job fails silently, and by the time you spot the gap, the miss has already happened.
- Nothing guarantees execution. These features aren’t built as production job runners. There’s rarely a retry on failure or a way to backfill a missed run, and the vendor can pause, rate-limit, or change the feature without warning — taking your operation with it.
- They don’t hold at fleet size. One job is fine. Ten of them scattered across someone’s personal account (no central view, no versioning, no handover) is an operational mess, and all of it dies the day that one account or login does.
Road B — run your own infrastructure
The alternative is to stop bolting on and run recurring work on real plumbing. This is more reliable — and it is not free. “Not native” means you now own:
- An orchestrator to run and sequence the jobs — cron at the small end, something like n8n, Make, or a proper workflow engine as you grow.
- Monitoring that proves a job ran and succeeded, and pings you the moment one doesn’t.
- Retries, idempotency, and error handling so a one-off blip doesn’t turn into a silent gap or a double-send.
- Somewhere to keep secrets and logs you can actually read — and someone who owns all of it. Infrastructure isn’t a one-time build; it’s a standing responsibility.
There’s no free lunch
The mistake isn’t picking the wrong road. It’s not realising there was a fork. Choosing native tools is choosing to accept those limits — which is completely fine when you’ve decided it on purpose, for work where a silent miss costs you little. It’s only dangerous when you back into it without knowing you chose. The honest way to decide is to ask what a silently-missed run actually costs you:
| Your situation | Cost of a silent miss | Sensible choice |
|---|---|---|
| A “nice to have” daily brief | Low — you’d notice and catch up | Native tool is fine. Add a heartbeat (below). |
| Customer-facing automation (replies, follow-ups) | Medium — a dropped customer | Native plus monitoring, or light orchestration. |
| A miss means money, compliance, or a court date | High — hard to undo | Own infrastructure: retries, alerting, redundancy. |
| Many jobs across a team (10+) | Compounding — unmanageable sprawl | Central orchestration. Stop scattering jobs in accounts. |
The cheapest reliability you can buy
Whichever road you’re on, there’s one move with a return out of all proportion to its cost: a heartbeat, or dead-man’s switch. The job checks in every time it runs; if an expected check-in doesn’t arrive, you get pinged. It’s nearly free (a one-line ping to a monitor) and it converts the worst failure mode, fails silently, into the manageable one, fails loudly. Even on native tools, this closes the single biggest gap. If you do nothing else from this page, do this.
What 30+ tool evaluations at Meta taught me
At Meta I assessed more than 30 AI platforms and brought it down to the 3 we actually kept. Every one of them demoed beautifully — the bolt-on always does. What separated the survivors wasn’t features. It was whether you could see it run, trust it to run, and manage it once there were dozens in flight across hundreds of people. So the evaluation framework I built scored tools on operability, not capability. The clever-but-unobservable ones got cut — because in production, a tool you can’t see is a tool you can’t trust. That’s the lens I bring to client work: not “what can this do in a demo,” but “what will it take to make it hold.”