A deployment platform without monitoring is a deployment platform where problems are discovered by users. Your customers notice the outage before you do. Your Slack channel fills with complaints while your dashboard shows everything green. You SSH in, check Docker stats, and realize a container has been OOM-killed for the past three hours.
We built monitoring and alerting across three phases of sh0's development. Phase 10 established the metric collection pipeline and alert evaluation engine. Phase 17 added multi-channel alert dispatch -- email, Slack, Discord, Telegram, and custom webhooks. A later refactor session rebuilt the monitoring dashboard with real-time sparkline charts and per-app resource breakdowns.
This is the full story: from Docker stats to a Telegram message that tells you your app is down.
Phase 10: The Metric Collection Pipeline
The monitoring system had three background tasks, each running on its own interval:
- MetricCollector -- every 10 seconds, collects CPU, memory, and network stats from all managed containers
- AlertEvaluator -- every 30 seconds, compares latest metrics against configured thresholds
- MetricPruner -- every hour, deletes metrics older than the retention window
Collecting Container Stats
The collector listed all Docker containers with the sh0.managed=true label, then called the Docker stats API for each one:
pub struct MetricCollector {
db: Arc<DbPool>,
docker: Arc<DockerClient>,
}impl MetricCollector { pub async fn collect(&self) -> Result<()> { let containers = self.docker .list_containers_with_label("sh0.managed=true") .await?;
let mut metrics = Vec::new();
for container in &containers { let stats = self.docker.container_stats(&container.id).await?;
metrics.push(Metric::new(&container.app_id, "cpu", stats.cpu_percent)); metrics.push(Metric::new(&container.app_id, "memory", stats.memory_bytes as f64)); metrics.push(Metric::new(&container.app_id, "network_rx", stats.network_rx as f64)); metrics.push(Metric::new(&container.app_id, "network_tx", stats.network_tx as f64)); }
// Server-level aggregates let total_cpu: f64 = metrics.iter() .filter(|m| m.metric_type == "cpu") .map(|m| m.value) .sum();
metrics.push(Metric::new("server", "server_cpu", total_cpu)); metrics.push(Metric::new("server", "server_memory_percent", / computed /)); metrics.push(Metric::new("server", "server_memory_limit", total_memory as f64));
// Batch insert in a single transaction Metric::insert_batch(&self.db, &metrics).await?;
Ok(()) } } ```
The batch insert was critical for performance. With 20 running containers, each collection cycle produced 80+ metrics. Individual inserts would create 80 database transactions. A single batch insert created one transaction, reducing disk I/O by two orders of magnitude.
The collector also computed server-level aggregates: total CPU utilization across all containers, total memory usage, and the memory limit. These powered the monitoring dashboard's overview cards without requiring the frontend to aggregate raw per-container metrics.
Memory: Bytes vs. Percentages
An early bug in the monitoring dashboard displayed raw memory bytes as percentages. The root cause was straightforward: the collector stored server_memory as bytes, but the frontend template used ${value.toFixed(1)}%.
The fix added two new metric types: server_memory_percent (0-100 scale) and server_memory_limit (total available bytes). The dashboard could then display both the percentage bar and the "2.4 GB / 8.0 GB" subtitle that users actually wanted to see.
Alert Evaluation
The AlertEvaluator loaded all enabled alert rules and compared them against the latest metrics:
pub struct AlertEvaluator {
db: Arc<DbPool>,
dispatcher: Option<Arc<AlertDispatcher>>,
}impl AlertEvaluator { pub async fn evaluate(&self) -> Result<()> { let alerts = Alert::list_enabled(&self.db).await?;
for alert in &alerts { let should_fire = match alert.alert_type.as_str() { "high_cpu" => { let metric = Metric::latest(&self.db, &alert.app_id, "cpu").await?; metric.map(|m| m.value > alert.threshold).unwrap_or(false) } "high_memory" => { let metric = Metric::latest(&self.db, &alert.app_id, "memory").await?; metric.map(|m| m.value > alert.threshold).unwrap_or(false) } "app_down" => { let containers = self.docker .list_containers_for_app(&alert.app_id).await?; containers.is_empty() || containers.iter().all(|c| c.state != "running") } _ => false, };
if should_fire && !self.in_cooldown(alert) { self.fire_alert(alert).await?; } }
Ok(()) } } ```
Three alert types were supported:
- high_cpu -- fires when the app's CPU usage exceeds the configured threshold (e.g., 80%)
- high_memory -- fires when memory usage exceeds the threshold
- app_down -- fires when no running containers exist for the app
The 5-minute cooldown prevented alert storms. Without it, a sustained CPU spike would generate an alert every 30 seconds -- flooding the user's inbox and Slack channel and training them to ignore alerts entirely.
Phase 17: Multi-Channel Alert Dispatch
The alert evaluator could detect problems. It could not tell anyone about them. Phase 17 added the AlertDispatcher -- a routing layer that delivered alert messages to five channels.
The Dispatcher Architecture
pub struct AlertDispatcher {
http_client: reqwest::Client,
smtp_config: Option<SmtpConfig>,
}pub struct AlertMessage {
pub alert_type: String,
pub app_name: String,
pub threshold: f64,
pub current_value: f64,
pub timestamp: DateTime
impl AlertDispatcher { pub async fn dispatch( &self, alert: &Alert, message: &AlertMessage, ) -> Result<()> { match alert.channel.as_str() { "email" => self.send_email(alert, message).await, "slack" => self.send_slack(alert, message).await, "discord" => self.send_discord(alert, message).await, "telegram" => self.send_telegram(alert, message).await, "webhook" => self.send_webhook(alert, message).await, _ => Err(anyhow!("Unknown channel: {}", alert.channel)), } } } ```
Each channel had its own delivery module, tailored to the platform's API and formatting requirements.
Email: SMTP with HTML
Email alerts used lettre with STARTTLS, sending both HTML and plain text multipart messages. Multiple recipients were supported for team notifications:
async fn send_email(&self, alert: &Alert, msg: &AlertMessage) -> Result<()> {
let smtp = self.smtp_config.as_ref()
.ok_or_else(|| anyhow!("SMTP not configured"))?;let email = Message::builder() .from(smtp.from.parse()?) .to(alert.target.parse()?) .subject(msg.summary()) .multipart( MultiPart::alternative() .singlepart(SinglePart::plain(msg.to_plain_text())) .singlepart(SinglePart::html(msg.to_html())), )?;
let transport = SmtpTransport::starttls_relay(&smtp.host)? .port(smtp.port) .credentials(Credentials::new( smtp.user.clone(), smtp.password.clone(), )) .build();
transport.send(&email)?; Ok(()) } ```
SMTP configuration was passed through six CLI arguments (--smtp-host, --smtp-port, --smtp-user, --smtp-password, --smtp-from, --smtp-tls) with corresponding SH0_SMTP_* environment variable support.
Slack: Block Kit
Slack notifications used incoming webhooks with Block Kit formatting for rich, structured alerts:
async fn send_slack(&self, alert: &Alert, msg: &AlertMessage) -> Result<()> {
let payload = json!({
"blocks": [
{
"type": "header",
"text": { "type": "plain_text", "text": msg.summary() }
},
{
"type": "section",
"fields": [
{ "type": "mrkdwn", "text": format!("*App:* {}", msg.app_name) },
{ "type": "mrkdwn", "text": format!("*Type:* {}", msg.alert_type) },
{ "type": "mrkdwn", "text": format!("*Value:* {:.1}%", msg.current_value) },
{ "type": "mrkdwn", "text": format!("*Threshold:* {:.1}%", msg.threshold) },
]
}
],
"attachments": [{
"color": "#ff0000",
"text": format!("Triggered at {}", msg.timestamp.format("%Y-%m-%d %H:%M:%S UTC"))
}]
});self.http_client.post(&alert.target) .json(&payload) .timeout(Duration::from_secs(15)) .send().await?;
Ok(()) } ```
Discord: Embeds
Discord webhooks used embed objects with color-coded severity, field-based layouts, and ISO 8601 timestamps:
async fn send_discord(&self, alert: &Alert, msg: &AlertMessage) -> Result<()> {
let payload = json!({
"embeds": [{
"title": msg.summary(),
"color": 0xff0000, // Red
"fields": [
{ "name": "Application", "value": &msg.app_name, "inline": true },
{ "name": "Current Value", "value": format!("{:.1}%", msg.current_value), "inline": true },
{ "name": "Threshold", "value": format!("{:.1}%", msg.threshold), "inline": true },
],
"timestamp": msg.timestamp.to_rfc3339(),
}]
});self.http_client.post(&alert.target) .json(&payload) .send().await?;
Ok(()) } ```
Telegram: MarkdownV2
Telegram's Bot API required MarkdownV2 formatting with aggressive character escaping. Special characters (_, *, [, ], (, ), ~, ` `, >, #, +, -, =, |, {, }, ., !`) all needed backslash escaping:
async fn send_telegram(&self, alert: &Alert, msg: &AlertMessage) -> Result<()> {
let text = format!(
"*{}*\n\nApp: {}\nType: {}\nValue: {:.1}%\nThreshold: {:.1}%",
escape_markdown(&msg.summary()),
escape_markdown(&msg.app_name),
escape_markdown(&msg.alert_type),
msg.current_value,
msg.threshold,
);let (bot_token, chat_id) = parse_telegram_target(&alert.target)?;
self.http_client .post(&format!("https://api.telegram.org/bot{}/sendMessage", bot_token)) .json(&json!({ "chat_id": chat_id, "text": text, "parse_mode": "MarkdownV2", })) .send().await?;
Ok(()) } ```
Custom Webhooks: HMAC-SHA256
For users who wanted to integrate with their own systems, the generic webhook channel sent a POST request with a JSON payload and optional HMAC-SHA256 signature verification:
async fn send_webhook(&self, alert: &Alert, msg: &AlertMessage) -> Result<()> {
let payload = serde_json::to_vec(&json!({
"alert_type": msg.alert_type,
"app_name": msg.app_name,
"threshold": msg.threshold,
"current_value": msg.current_value,
"timestamp": msg.timestamp.to_rfc3339(),
}))?;let mut request = self.http_client.post(&alert.target) .header("Content-Type", "application/json") .body(payload.clone());
// HMAC signing if secret is configured
if let Some(ref secret) = alert.webhook_secret {
let mut mac = Hmac::
request.send().await?; Ok(()) } ```
The X-sh0-Signature header let webhook receivers verify that the request actually came from sh0 and was not forged. The receiver computed the same HMAC over the request body and compared signatures -- the same pattern GitHub uses for its webhook deliveries.
Deploy Failure Alerts
Alert dispatch was not limited to metric thresholds. The deployment pipeline was wired to dispatch alerts when a deploy failed:
When a deployment failed, the pipeline queried all deploy_failed alerts for the app and dispatched each one asynchronously. This meant that a failed deploy could simultaneously trigger a Slack message to the team channel, a Telegram message to the on-call developer, and a webhook to the incident management system.
Test Alerts
The dashboard added a "Test" button on each alert card. Clicking it sent a sample AlertMessage through the configured channel without waiting for a real threshold breach. This let users verify their webhook URLs, Slack channel configurations, and Telegram bot tokens before a real incident.
The Monitoring Dashboard
The monitoring page was rewritten from a simple metric display into a four-tab dashboard:
Overview -- three large metric cards for CPU, Memory, and Disk, each with a percentage display, a color-coded progress bar (green below 60%, yellow between 60-80%, red above 80%), and a 60-point sparkline chart with 5-second auto-refresh. Memory showed a "used / limit" subtitle in human-readable bytes.
Apps -- per-application resource breakdown with card-based layout showing CPU %, Memory % and bytes, Network RX/TX. Sortable by CPU or Memory. 10-second auto-refresh. This was the view that answered "which app is consuming all the resources?"
Uptime -- global uptime check list with status indicators (up/down/unknown), URL, HTTP method badge, check interval, and uptime percentage. Expandable inline incident history. Create/delete modals for managing checks.
Alerts -- the existing alert CRUD interface with the test button, channel dropdown (now including Discord and Telegram), and threshold configuration.
Shared formatting utilities (formatBytes() and formatDuration()) were extracted into a shared module to eliminate the duplication that had spread across five dashboard files.
The Numbers
The monitoring system spanned three Rust crates (sh0-monitor, sh0-api, sh0-db), five dispatch modules, two database models (Metric, Alert), and a complete dashboard page with four tabs. Metric collection ran every 10 seconds, alert evaluation every 30 seconds, metric pruning every hour. Five notification channels covered the platforms where developers actually receive alerts.
The test suite grew by 11 tests in Phase 10 and additional tests in Phase 17. The system was designed to be lightweight -- the collector added minimal overhead (one Docker stats call per container per 10-second cycle) and the evaluator was a simple threshold comparison, not a time-series analysis engine.
For a PaaS, this was the right trade-off. Users who needed Grafana-level observability could deploy Grafana as a one-click template (Article 19). For everyone else, the built-in monitoring answered the two questions that matter most: "Is my app running?" and "Is it running out of resources?"
---
This concludes the monitoring and alerting deep-dive in the "How We Built sh0.dev" series. The full series covers the 14-day journey from an empty Cargo workspace to a production PaaS -- built by a CEO in Abidjan and an AI CTO, with zero human engineers.