Building a Production-Grade Async OpenAI Status Tracker
A deep dive into the architecture, concurrency model, and lessons learned building a 24/7 incident monitoring service.
Software developer with a strong foundation in React, Node.js, PostgreSQL, and AI-driven applications. Experienced in remote sensing, satellite image analysis, and vector databases. Passionate about defense tech, space applications, and problem-solving. Currently building AI-powered solutions and preparing for a future in special forces.
1. Problem Statement & Goals
The Problem
OpenAI's status page (status.openai.com) publishes incident information via RSS feeds. There is no official webhook, push notification, or programmatic API to track incidents in real-time. The only way to stay informed is to manually check the page or poll the RSS feed yourself.
What We Want
Detect new incidents automatically — without manual checking.
Track the full lifecycle of each incident:
Investigating → Identified → Monitoring → Resolved.Persist complete history — every status update, message, and affected component.
Stop wasting resources — once an incident resolves, stop polling it.
Survive restarts — if the process crashes, resume tracking active incidents from the database.
Expose the data — via a REST API so other systems or dashboards can consume it.
Non-Goals
We do not send alerts or notifications (no email, Slack, PagerDuty — that's a consumer's job).
We do not scrape the HTML status page — RSS feeds only.
We do not provide a UI — just an API.
We do not support write operations — read-only API.
2. High-Level Architecture
Key architectural insight: The background monitor and the HTTP API run in the same process, sharing the same asyncpg connection pool. This avoids the complexity of inter-process communication, shared state, or a message queue — Python's asyncio event loop handles both concurrently.
3. Thinking Process & Key Decisions
3.1 Why a single process instead of separate monitor + API processes?
For a single-provider monitoring system at this scale, the added complexity of a message broker (Redis, RabbitMQ) is pure overhead. A single asyncio process can handle hundreds of concurrent tracker tasks and serve HTTP requests simultaneously. Simple, deployable to a single Render instance at zero extra cost.
3.2 Why asyncio instead of threading?
The workload is I/O-bound. Almost all time is spent waiting for HTTP responses or database queries. asyncio coroutines are far more memory-efficient than threads, and asyncio.create_task() gives us true concurrency for free without GIL issues.
3.3 The Asymmetric Polling Strategy
We use two different feed levels to balance speed and resource usage:
Global feed (5-min interval): Our "discovery" mechanism. We learn about new incidents here.
Per-incident feed (5-sec interval): Once an incident is active, we track it in near-real-time to catch every lifecycle update.
3.4 SHA-256 Hash Caching
OpenAI's RSS server does not reliably return ETag or Last-Modified headers. To avoid parsing XML and running DB queries on every 5-second poll, we hash the raw response body. If the hash hasn't changed, we skip all processing.
4. Component Deep Dive
4.1 FeedClient (client.py)
Handles all HTTP communication. It uses a single shared aiohttp.ClientSession to manage connection pooling and DNS caching. It implements an exponential backoff retry strategy for 5xx errors.
4.2 IncidentRepository (repository.py)
The only file that knows SQL. It manages the asyncpg connection pool and uses INSERT OR IGNORE (Postgres ON CONFLICT DO NOTHING) to ensure idempotency.
4.3 IncidentTracker (tracker.py)
Monitors a single incident's lifecycle. It sorts RSS entries chronologically to ensure the current_status always reflects the most recent update. It runs as a dedicated asyncio.Task.
5. Data Flow: The Lifecycle of an Incident
Detection:
IncidentManagerdetects a new ID in the global feed.Tracking: An
IncidentTrackeris spawned. It polls the specific incident feed every 5 seconds.Persistence: New updates are pushed to Supabase.
Resolution: Once "Resolved" is detected, the tracker marks the DB record as inactive and the task exits naturally.
Recovery: On a service restart, the manager queries the DB for any incident where
is_active = TRUEand resumes tracking immediately.
6. Database Design
We use two main tables in Supabase:
SQL
-- One row per incident
CREATE TABLE incidents (
incident_id TEXT PRIMARY KEY, -- OpenAI's ULID
title TEXT NOT NULL,
current_status TEXT NOT NULL,
is_active BOOLEAN NOT NULL DEFAULT TRUE,
created_at TIMESTAMPTZ NOT NULL,
updated_at TIMESTAMPTZ NOT NULL
);
-- One row per lifecycle update
CREATE TABLE incident_updates (
update_id TEXT PRIMARY KEY, -- GUID fragment
incident_id TEXT NOT NULL REFERENCES incidents(incident_id),
status TEXT NOT NULL,
message TEXT NOT NULL,
components JSONB NOT NULL DEFAULT '[]', -- JSONB for flexibility
timestamp TEXT NOT NULL
);
7. API Design
The API is built with FastAPI and is read-only.
GET /incidents: List all incidents with filtering and pagination.GET /incidents/active: Quick access to ongoing issues.GET /incidents/{id}: Full detail including every lifecycle update.
8. Reliability & Edge Cases
Idempotency: Using ULID-based GUIDs as primary keys prevents duplicate entries, even if the feed is re-parsed.
Memory Safety: The
done_callbackpattern ensures that once an incident is resolved, its tracker task is removed from memory.Cold Start: On the first run with an empty database, the system populates itself from the last 50 incidents in the global feed within seconds.
9. Deployment on Render
The service is deployed on Render's free tier.
The Constraint: Free services spin down after 15 minutes of inactivity.
The Fix: We use an external uptime monitor to ping the
/healthendpoint every 5 minutes. This keeps the background tracking process alive 24/7.
10. Future Improvements
Webhook Support: Send a POST request to a user-defined URL when an incident status changes.
Multi-Provider: Add support for Anthropic, AWS, and GitHub status feeds.
WebSocket Stream: Provide a real-time stream of updates for dashboard users.



