Skip to main content

Command Palette

Search for a command to run...

Building a Production-Grade Async OpenAI Status Tracker

A deep dive into the architecture, concurrency model, and lessons learned building a 24/7 incident monitoring service.

Updated
5 min read
S

Software developer with a strong foundation in React, Node.js, PostgreSQL, and AI-driven applications. Experienced in remote sensing, satellite image analysis, and vector databases. Passionate about defense tech, space applications, and problem-solving. Currently building AI-powered solutions and preparing for a future in special forces.

1. Problem Statement & Goals

The Problem

OpenAI's status page (status.openai.com) publishes incident information via RSS feeds. There is no official webhook, push notification, or programmatic API to track incidents in real-time. The only way to stay informed is to manually check the page or poll the RSS feed yourself.

What We Want

  • Detect new incidents automatically — without manual checking.

  • Track the full lifecycle of each incident: Investigating → Identified → Monitoring → Resolved.

  • Persist complete history — every status update, message, and affected component.

  • Stop wasting resources — once an incident resolves, stop polling it.

  • Survive restarts — if the process crashes, resume tracking active incidents from the database.

  • Expose the data — via a REST API so other systems or dashboards can consume it.

Non-Goals

  • We do not send alerts or notifications (no email, Slack, PagerDuty — that's a consumer's job).

  • We do not scrape the HTML status page — RSS feeds only.

  • We do not provide a UI — just an API.

  • We do not support write operations — read-only API.

2. High-Level Architecture

Key architectural insight: The background monitor and the HTTP API run in the same process, sharing the same asyncpg connection pool. This avoids the complexity of inter-process communication, shared state, or a message queue — Python's asyncio event loop handles both concurrently.

3. Thinking Process & Key Decisions

3.1 Why a single process instead of separate monitor + API processes?

For a single-provider monitoring system at this scale, the added complexity of a message broker (Redis, RabbitMQ) is pure overhead. A single asyncio process can handle hundreds of concurrent tracker tasks and serve HTTP requests simultaneously. Simple, deployable to a single Render instance at zero extra cost.

3.2 Why asyncio instead of threading?

The workload is I/O-bound. Almost all time is spent waiting for HTTP responses or database queries. asyncio coroutines are far more memory-efficient than threads, and asyncio.create_task() gives us true concurrency for free without GIL issues.

3.3 The Asymmetric Polling Strategy

We use two different feed levels to balance speed and resource usage:

  • Global feed (5-min interval): Our "discovery" mechanism. We learn about new incidents here.

  • Per-incident feed (5-sec interval): Once an incident is active, we track it in near-real-time to catch every lifecycle update.

3.4 SHA-256 Hash Caching

OpenAI's RSS server does not reliably return ETag or Last-Modified headers. To avoid parsing XML and running DB queries on every 5-second poll, we hash the raw response body. If the hash hasn't changed, we skip all processing.

4. Component Deep Dive

4.1 FeedClient (client.py)

Handles all HTTP communication. It uses a single shared aiohttp.ClientSession to manage connection pooling and DNS caching. It implements an exponential backoff retry strategy for 5xx errors.

4.2 IncidentRepository (repository.py)

The only file that knows SQL. It manages the asyncpg connection pool and uses INSERT OR IGNORE (Postgres ON CONFLICT DO NOTHING) to ensure idempotency.

4.3 IncidentTracker (tracker.py)

Monitors a single incident's lifecycle. It sorts RSS entries chronologically to ensure the current_status always reflects the most recent update. It runs as a dedicated asyncio.Task.

5. Data Flow: The Lifecycle of an Incident

  1. Detection: IncidentManager detects a new ID in the global feed.

  2. Tracking: An IncidentTracker is spawned. It polls the specific incident feed every 5 seconds.

  3. Persistence: New updates are pushed to Supabase.

  4. Resolution: Once "Resolved" is detected, the tracker marks the DB record as inactive and the task exits naturally.

  5. Recovery: On a service restart, the manager queries the DB for any incident where is_active = TRUE and resumes tracking immediately.

6. Database Design

We use two main tables in Supabase:

SQL

-- One row per incident
CREATE TABLE incidents (
    incident_id    TEXT PRIMARY KEY,       -- OpenAI's ULID
    title          TEXT NOT NULL,
    current_status TEXT NOT NULL,
    is_active      BOOLEAN NOT NULL DEFAULT TRUE,
    created_at     TIMESTAMPTZ NOT NULL,
    updated_at     TIMESTAMPTZ NOT NULL
);

-- One row per lifecycle update
CREATE TABLE incident_updates (
    update_id   TEXT PRIMARY KEY,          -- GUID fragment
    incident_id TEXT NOT NULL REFERENCES incidents(incident_id),
    status      TEXT NOT NULL,
    message     TEXT NOT NULL,
    components  JSONB NOT NULL DEFAULT '[]', -- JSONB for flexibility
    timestamp   TEXT NOT NULL
);

7. API Design

The API is built with FastAPI and is read-only.

  • GET /incidents: List all incidents with filtering and pagination.

  • GET /incidents/active: Quick access to ongoing issues.

  • GET /incidents/{id}: Full detail including every lifecycle update.

8. Reliability & Edge Cases

  • Idempotency: Using ULID-based GUIDs as primary keys prevents duplicate entries, even if the feed is re-parsed.

  • Memory Safety: The done_callback pattern ensures that once an incident is resolved, its tracker task is removed from memory.

  • Cold Start: On the first run with an empty database, the system populates itself from the last 50 incidents in the global feed within seconds.

9. Deployment on Render

The service is deployed on Render's free tier.

  • The Constraint: Free services spin down after 15 minutes of inactivity.

  • The Fix: We use an external uptime monitor to ping the /health endpoint every 5 minutes. This keeps the background tracking process alive 24/7.

10. Future Improvements

  • Webhook Support: Send a POST request to a user-defined URL when an incident status changes.

  • Multi-Provider: Add support for Anthropic, AWS, and GitHub status feeds.

  • WebSocket Stream: Provide a real-time stream of updates for dashboard users.

6 views