APIs, Integrations, and Automation — Lesson 4
Building Reliable Automations
Learning Objectives
- 1Identify common automation failure modes and how to prevent them.
- 2Plan for rate limits, duplicates, retries, and error handling.
- 3Create monitoring and alerting for automated workflows.
Why automations fail
Automations fail for ordinary reasons: expired credentials, changed field names, invalid data, service outages, duplicate events, rate limits, permission changes, or someone renaming a form field without telling anyone. The challenging part is that automations can fail silently — everything looks fine on the surface while data stops flowing behind the scenes.
The difference between a fragile automation and a reliable one is not whether failures happen but whether failures are detected, reported, and handled. Reliable automations include logging, alerting, retry logic, and human-readable error messages.
Rate limits, duplicates, and retries
Rate limits cap how many API requests can be made in a time period. An integration that works fine during testing might fail during a marketing campaign or seasonal spike when volume increases dramatically. Know the rate limits of every API your automations use and plan for peak volume, not average volume.
Duplicate events are another common problem. A webhook might fire twice for the same event. A form might be submitted twice by an impatient user. Without duplicate protection, automations create duplicate records, send duplicate emails, or process duplicate payments. Build in deduplication using unique identifiers.
Retry logic determines what happens when an action fails temporarily. A good retry strategy waits before retrying, increases the wait time with each attempt, and eventually gives up and alerts a human. Without retries, a brief network hiccup can cause permanent data loss. Without a retry limit, a persistent error can create an infinite loop.
Monitoring and alerting
Every automation that affects customers, revenue, or critical operations should have monitoring. At minimum, this means logging what happened, alerting when failures occur, and providing enough detail for someone to diagnose the problem.
Good automation logs include: what triggered the workflow, what data was involved, what actions were attempted, what succeeded, what failed, and why. Logs should be accessible to the people responsible for the workflow, not buried in a developer tool that nobody checks.
Alert fatigue is real. If every minor issue generates an alert, people stop paying attention. Prioritize alerts for automations that affect revenue, customer experience, and regulatory compliance. Less critical automations can use daily summary reports instead of real-time alerts.
Testing automations properly
Automation testing should cover more than the happy path. Test with normal data, incomplete data, duplicate events, invalid values, expired credentials, missing permissions, and high-volume scenarios. The goal is to verify not just that the automation works but that it fails gracefully.
After testing, monitor the automation closely for the first week of production use. Real-world data is messier than test data. Edge cases that were not anticipated during testing will appear with real users and real data.
Case Study
The rate limit surprise
Situation
A nonprofit sent their annual fundraising email to 50,000 contacts. The email included a donation link that triggered an API call to their CRM for each donation. Within the first hour, 800 donations came in simultaneously, exceeding their CRM API rate limit. The remaining donations were silently dropped, and 200 donors were never recorded in the system.
Analysis
The automation was tested with small volumes and worked perfectly. Nobody checked the CRM API rate limit against the expected donation volume during a major campaign. Adding a queue that batched API calls below the rate limit would have prevented the data loss.
Takeaway
Always test automations against peak expected volume, not just typical volume. Rate limits that are invisible during normal operations become critical during campaigns and events.
Reflection Questions
- 1. For your most important automation, what would happen if it failed silently for a week? Would anyone notice?
- 2. Do you know the rate limits of the APIs your organization depends on?
Key Takeaways
- ✓Automations fail for ordinary reasons — the key is detecting and handling failures.
- ✓Plan for rate limits against peak volume, not average volume.
- ✓Build duplicate protection, retry logic, and alerting into every critical automation.
- ✓Test with incomplete data, duplicates, and high volume — not just the happy path.