Back to sh0
sh0

Monitoring and Alerts: Email, Slack, Discord, Telegram, Webhooks

Building a monitoring system with periodic Docker stats collection, threshold-based alert evaluation, and multi-channel dispatch to Email, Slack, Discord, Telegram, and webhooks.

Juste A. Gnimavo (Thales) & Claude | March 26, 2026 9 min sh0
EN/ FR/ ES
monitoringalertsmetricsslackdiscordtelegramdevopsrust

A deployment platform without monitoring is a deployment platform where problems are discovered by users. Your customers notice the outage before you do. Your Slack channel fills with complaints while your dashboard shows everything green. You SSH in, check Docker stats, and realize a container has been OOM-killed for the past three hours.

We built monitoring and alerting across three phases of sh0's development. Phase 10 established the metric collection pipeline and alert evaluation engine. Phase 17 added multi-channel alert dispatch -- email, Slack, Discord, Telegram, and custom webhooks. A later refactor session rebuilt the monitoring dashboard with real-time sparkline charts and per-app resource breakdowns.

This is the full story: from Docker stats to a Telegram message that tells you your app is down.

Phase 10: The Metric Collection Pipeline

The monitoring system had three background tasks, each running on its own interval:

  • MetricCollector -- every 10 seconds, collects CPU, memory, and network stats from all managed containers
  • AlertEvaluator -- every 30 seconds, compares latest metrics against configured thresholds
  • MetricPruner -- every hour, deletes metrics older than the retention window

Collecting Container Stats

The collector listed all Docker containers with the sh0.managed=true label, then called the Docker stats API for each one:

rustpub struct MetricCollector {
    db: Arc<DbPool>,
    docker: Arc<DockerClient>,
}

impl MetricCollector {
    pub async fn collect(&self) -> Result<()> {
        let containers = self.docker
            .list_containers_with_label("sh0.managed=true")
            .await?;

        let mut metrics = Vec::new();

        for container in &containers {
            let stats = self.docker.container_stats(&container.id).await?;

            metrics.push(Metric::new(&container.app_id, "cpu", stats.cpu_percent));
            metrics.push(Metric::new(&container.app_id, "memory", stats.memory_bytes as f64));
            metrics.push(Metric::new(&container.app_id, "network_rx", stats.network_rx as f64));
            metrics.push(Metric::new(&container.app_id, "network_tx", stats.network_tx as f64));
        }

        // Server-level aggregates
        let total_cpu: f64 = metrics.iter()
            .filter(|m| m.metric_type == "cpu")
            .map(|m| m.value)
            .sum();

        metrics.push(Metric::new("server", "server_cpu", total_cpu));
        metrics.push(Metric::new("server", "server_memory_percent", /* computed */));
        metrics.push(Metric::new("server", "server_memory_limit", total_memory as f64));

        // Batch insert in a single transaction
        Metric::insert_batch(&self.db, &metrics).await?;

        Ok(())
    }
}

The batch insert was critical for performance. With 20 running containers, each collection cycle produced 80+ metrics. Individual inserts would create 80 database transactions. A single batch insert created one transaction, reducing disk I/O by two orders of magnitude.

The collector also computed server-level aggregates: total CPU utilization across all containers, total memory usage, and the memory limit. These powered the monitoring dashboard's overview cards without requiring the frontend to aggregate raw per-container metrics.

Memory: Bytes vs. Percentages

An early bug in the monitoring dashboard displayed raw memory bytes as percentages. The root cause was straightforward: the collector stored server_memory as bytes, but the frontend template used ${value.toFixed(1)}%.

The fix added two new metric types: server_memory_percent (0-100 scale) and server_memory_limit (total available bytes). The dashboard could then display both the percentage bar and the "2.4 GB / 8.0 GB" subtitle that users actually wanted to see.

Alert Evaluation

The AlertEvaluator loaded all enabled alert rules and compared them against the latest metrics:

rustpub struct AlertEvaluator {
    db: Arc<DbPool>,
    dispatcher: Option<Arc<AlertDispatcher>>,
}

impl AlertEvaluator {
    pub async fn evaluate(&self) -> Result<()> {
        let alerts = Alert::list_enabled(&self.db).await?;

        for alert in &alerts {
            let should_fire = match alert.alert_type.as_str() {
                "high_cpu" => {
                    let metric = Metric::latest(&self.db, &alert.app_id, "cpu").await?;
                    metric.map(|m| m.value > alert.threshold).unwrap_or(false)
                }
                "high_memory" => {
                    let metric = Metric::latest(&self.db, &alert.app_id, "memory").await?;
                    metric.map(|m| m.value > alert.threshold).unwrap_or(false)
                }
                "app_down" => {
                    let containers = self.docker
                        .list_containers_for_app(&alert.app_id).await?;
                    containers.is_empty() || containers.iter().all(|c| c.state != "running")
                }
                _ => false,
            };

            if should_fire && !self.in_cooldown(alert) {
                self.fire_alert(alert).await?;
            }
        }

        Ok(())
    }
}

Three alert types were supported:

  • high_cpu -- fires when the app's CPU usage exceeds the configured threshold (e.g., 80%)
  • high_memory -- fires when memory usage exceeds the threshold
  • app_down -- fires when no running containers exist for the app

The 5-minute cooldown prevented alert storms. Without it, a sustained CPU spike would generate an alert every 30 seconds -- flooding the user's inbox and Slack channel and training them to ignore alerts entirely.

Phase 17: Multi-Channel Alert Dispatch

The alert evaluator could detect problems. It could not tell anyone about them. Phase 17 added the AlertDispatcher -- a routing layer that delivered alert messages to five channels.

The Dispatcher Architecture

rustpub struct AlertDispatcher {
    http_client: reqwest::Client,
    smtp_config: Option<SmtpConfig>,
}

pub struct AlertMessage {
    pub alert_type: String,
    pub app_name: String,
    pub threshold: f64,
    pub current_value: f64,
    pub timestamp: DateTime<Utc>,
}

impl AlertDispatcher {
    pub async fn dispatch(
        &self,
        alert: &Alert,
        message: &AlertMessage,
    ) -> Result<()> {
        match alert.channel.as_str() {
            "email"    => self.send_email(alert, message).await,
            "slack"    => self.send_slack(alert, message).await,
            "discord"  => self.send_discord(alert, message).await,
            "telegram" => self.send_telegram(alert, message).await,
            "webhook"  => self.send_webhook(alert, message).await,
            _ => Err(anyhow!("Unknown channel: {}", alert.channel)),
        }
    }
}

Each channel had its own delivery module, tailored to the platform's API and formatting requirements.

Email: SMTP with HTML

Email alerts used lettre with STARTTLS, sending both HTML and plain text multipart messages. Multiple recipients were supported for team notifications:

rustasync fn send_email(&self, alert: &Alert, msg: &AlertMessage) -> Result<()> {
    let smtp = self.smtp_config.as_ref()
        .ok_or_else(|| anyhow!("SMTP not configured"))?;

    let email = Message::builder()
        .from(smtp.from.parse()?)
        .to(alert.target.parse()?)
        .subject(msg.summary())
        .multipart(
            MultiPart::alternative()
                .singlepart(SinglePart::plain(msg.to_plain_text()))
                .singlepart(SinglePart::html(msg.to_html())),
        )?;

    let transport = SmtpTransport::starttls_relay(&smtp.host)?
        .port(smtp.port)
        .credentials(Credentials::new(
            smtp.user.clone(),
            smtp.password.clone(),
        ))
        .build();

    transport.send(&email)?;
    Ok(())
}

SMTP configuration was passed through six CLI arguments (--smtp-host, --smtp-port, --smtp-user, --smtp-password, --smtp-from, --smtp-tls) with corresponding SH0_SMTP_* environment variable support.

Slack: Block Kit

Slack notifications used incoming webhooks with Block Kit formatting for rich, structured alerts:

rustasync fn send_slack(&self, alert: &Alert, msg: &AlertMessage) -> Result<()> {
    let payload = json!({
        "blocks": [
            {
                "type": "header",
                "text": { "type": "plain_text", "text": msg.summary() }
            },
            {
                "type": "section",
                "fields": [
                    { "type": "mrkdwn", "text": format!("*App:* {}", msg.app_name) },
                    { "type": "mrkdwn", "text": format!("*Type:* {}", msg.alert_type) },
                    { "type": "mrkdwn", "text": format!("*Value:* {:.1}%", msg.current_value) },
                    { "type": "mrkdwn", "text": format!("*Threshold:* {:.1}%", msg.threshold) },
                ]
            }
        ],
        "attachments": [{
            "color": "#ff0000",
            "text": format!("Triggered at {}", msg.timestamp.format("%Y-%m-%d %H:%M:%S UTC"))
        }]
    });

    self.http_client.post(&alert.target)
        .json(&payload)
        .timeout(Duration::from_secs(15))
        .send().await?;

    Ok(())
}

Discord: Embeds

Discord webhooks used embed objects with color-coded severity, field-based layouts, and ISO 8601 timestamps:

rustasync fn send_discord(&self, alert: &Alert, msg: &AlertMessage) -> Result<()> {
    let payload = json!({
        "embeds": [{
            "title": msg.summary(),
            "color": 0xff0000, // Red
            "fields": [
                { "name": "Application", "value": &msg.app_name, "inline": true },
                { "name": "Current Value", "value": format!("{:.1}%", msg.current_value), "inline": true },
                { "name": "Threshold", "value": format!("{:.1}%", msg.threshold), "inline": true },
            ],
            "timestamp": msg.timestamp.to_rfc3339(),
        }]
    });

    self.http_client.post(&alert.target)
        .json(&payload)
        .send().await?;

    Ok(())
}

Telegram: MarkdownV2

Telegram's Bot API required MarkdownV2 formatting with aggressive character escaping. Special characters (_, *, [, ], (, ), ~, ` `, >, #, +, -, =, |, {, }, ., !`) all needed backslash escaping:

rustasync fn send_telegram(&self, alert: &Alert, msg: &AlertMessage) -> Result<()> {
    let text = format!(
        "*{}*\n\nApp: {}\nType: {}\nValue: {:.1}%\nThreshold: {:.1}%",
        escape_markdown(&msg.summary()),
        escape_markdown(&msg.app_name),
        escape_markdown(&msg.alert_type),
        msg.current_value,
        msg.threshold,
    );

    let (bot_token, chat_id) = parse_telegram_target(&alert.target)?;

    self.http_client
        .post(&format!("https://api.telegram.org/bot{}/sendMessage", bot_token))
        .json(&json!({
            "chat_id": chat_id,
            "text": text,
            "parse_mode": "MarkdownV2",
        }))
        .send().await?;

    Ok(())
}

Custom Webhooks: HMAC-SHA256

For users who wanted to integrate with their own systems, the generic webhook channel sent a POST request with a JSON payload and optional HMAC-SHA256 signature verification:

rustasync fn send_webhook(&self, alert: &Alert, msg: &AlertMessage) -> Result<()> {
    let payload = serde_json::to_vec(&json!({
        "alert_type": msg.alert_type,
        "app_name": msg.app_name,
        "threshold": msg.threshold,
        "current_value": msg.current_value,
        "timestamp": msg.timestamp.to_rfc3339(),
    }))?;

    let mut request = self.http_client.post(&alert.target)
        .header("Content-Type", "application/json")
        .body(payload.clone());

    // HMAC signing if secret is configured
    if let Some(ref secret) = alert.webhook_secret {
        let mut mac = Hmac::<Sha256>::new_from_slice(secret.as_bytes())?;
        mac.update(&payload);
        let signature = hex::encode(mac.finalize().into_bytes());
        request = request.header("X-sh0-Signature", format!("sha256={}", signature));
    }

    request.send().await?;
    Ok(())
}

The X-sh0-Signature header let webhook receivers verify that the request actually came from sh0 and was not forged. The receiver computed the same HMAC over the request body and compared signatures -- the same pattern GitHub uses for its webhook deliveries.

Deploy Failure Alerts

Alert dispatch was not limited to metric thresholds. The deployment pipeline was wired to dispatch alerts when a deploy failed:

When a deployment failed, the pipeline queried all deploy_failed alerts for the app and dispatched each one asynchronously. This meant that a failed deploy could simultaneously trigger a Slack message to the team channel, a Telegram message to the on-call developer, and a webhook to the incident management system.

Test Alerts

The dashboard added a "Test" button on each alert card. Clicking it sent a sample AlertMessage through the configured channel without waiting for a real threshold breach. This let users verify their webhook URLs, Slack channel configurations, and Telegram bot tokens before a real incident.

The Monitoring Dashboard

The monitoring page was rewritten from a simple metric display into a four-tab dashboard:

Overview -- three large metric cards for CPU, Memory, and Disk, each with a percentage display, a color-coded progress bar (green below 60%, yellow between 60-80%, red above 80%), and a 60-point sparkline chart with 5-second auto-refresh. Memory showed a "used / limit" subtitle in human-readable bytes.

Apps -- per-application resource breakdown with card-based layout showing CPU %, Memory % and bytes, Network RX/TX. Sortable by CPU or Memory. 10-second auto-refresh. This was the view that answered "which app is consuming all the resources?"

Uptime -- global uptime check list with status indicators (up/down/unknown), URL, HTTP method badge, check interval, and uptime percentage. Expandable inline incident history. Create/delete modals for managing checks.

Alerts -- the existing alert CRUD interface with the test button, channel dropdown (now including Discord and Telegram), and threshold configuration.

Shared formatting utilities (formatBytes() and formatDuration()) were extracted into a shared module to eliminate the duplication that had spread across five dashboard files.

The Numbers

The monitoring system spanned three Rust crates (sh0-monitor, sh0-api, sh0-db), five dispatch modules, two database models (Metric, Alert), and a complete dashboard page with four tabs. Metric collection ran every 10 seconds, alert evaluation every 30 seconds, metric pruning every hour. Five notification channels covered the platforms where developers actually receive alerts.

The test suite grew by 11 tests in Phase 10 and additional tests in Phase 17. The system was designed to be lightweight -- the collector added minimal overhead (one Docker stats call per container per 10-second cycle) and the evaluator was a simple threshold comparison, not a time-series analysis engine.

For a PaaS, this was the right trade-off. Users who needed Grafana-level observability could deploy Grafana as a one-click template (Article 19). For everyone else, the built-in monitoring answered the two questions that matter most: "Is my app running?" and "Is it running out of resources?"


This concludes the monitoring and alerting deep-dive in the "How We Built sh0.dev" series. The full series covers the 14-day journey from an empty Cargo workspace to a production PaaS -- built by a CEO in Abidjan and an AI CTO, with zero human engineers.

Share this article:

Responses

Write a response
0/2000
Loading responses...

Related Articles

Thales & Claude thales

Thirteen Agents, Forty-Three Minutes: The First Claude Fable 5 Workflow Session, And What A Deterministic Orchestration Script Changes About Multi-Agent Builds

One prompt, thirteen agents, forty-three minutes: the first production session with Claude Fable 5 and Claude Code's Workflow tool shipped a complete seven-page production website plus a backend lead-capture endpoint in a single commit. The build log: the deterministic orchestration script, the contract-injection pattern between phases, the per-agent economics of the parallel fan-out, and the session-limit cliffhanger the resume journal turned into a non-event.

20 min Jun 12, 2026
claude-fable-5claude-codeworkflow-toolmulti-agent +10
Thales & Claude casp

The gate caught its own drift: one day inside CASP with Claude Fable 5

We handed the most autonomous Claude model yet the keys to CASP — the open-source CLI that keeps AI coding agents honest against git — with the authority to reject our own roadmap. It rejected five things, found two real bugs in the validator by dogfooding it, fixed them under a two-auditor gate, and left casp check fully green on its own repo for the first time. CASP 0.3.0 is the result.

14 min Jun 10, 2026
caspzerosuiteworkflowai-cto +9
Thales & Claude zerosuite

The CASP Transplant: How The Six-File Discipline Moved From Conductor To An Anti-Fraud Transport ERP, What The /next Skill Adds When The Operator Just Types 'next', And Why The Cost Of CASP Drift Rises When The Project Is Someone Else's Cash

The CASP discipline that ran thirty-five Conductor sessions is product-agnostic. The build log of transplanting it to KASSIA, an anti-fraud transport ERP for a Côte d'Ivoire fleet operator: what moved, what did not (the bespoke validator — and what its absence costs), what the /next skill adds when the operator types one word, and where the CASP stops — the deployment bug it could not see because it records intent, not infrastructure reality.

20 min Jun 8, 2026
kassiaerp-kassia-transport-logistiquezerosuiteCASP +15