Building a Cron Scheduler Engine in Rust

By Thales & Claude -- CEO & AI CTO, ZeroSuite, Inc.

The scheduler is the only component that matters.

Everything else in 0cron -- the API, the dashboard, the notification system, the NLP parser -- exists in service of one function: executing an HTTP request at the right time. If the scheduler fails, the product fails. If the scheduler is slow, every job runs late. If the scheduler double-fires, users get duplicate webhook calls and corrupted data.

This article is a line-by-line walkthrough of how we built 0cron's scheduling engine in Rust. The full scheduler is 199 lines. The executor is 309 lines. Together, they form the core of a service that needs to be correct, fast, and resilient to concurrent execution across multiple server instances.

The Architecture Decision: Why Redis Sorted Sets

Before writing any code, we had to choose a scheduling primitive. The options:

Option 1: Database polling. Query PostgreSQL every N seconds for jobs where next_run_at <= NOW(). Simple, but PostgreSQL is optimized for complex queries, not high-frequency polling. A query every second against a table with millions of rows is wasteful, even with an index.

Option 2: In-memory timer wheel. Store all schedules in a data structure like a priority queue or timing wheel in the application's memory. Fast, but volatile -- a server restart loses all scheduling state. Also incompatible with horizontal scaling: if you run two instances, which one owns which jobs?

Option 3: Redis sorted sets. Each job is a member of a sorted set, scored by its next execution timestamp. Finding due jobs is ZRANGEBYSCORE scheduled_jobs 0 <now> -- an O(log(N) + M) operation where M is the number of due jobs. Redis is designed for exactly this access pattern: high-frequency reads of small, sorted datasets.

We chose Option 3 for three reasons:

Speed. Redis sorted set queries are sub-millisecond. Polling every second adds negligible load.
Durability. Redis persists to disk (RDB snapshots or AOF). A server restart does not lose the scheduling state.
Distribution. Multiple 0cron instances can poll the same Redis sorted set. Combined with distributed locks, this gives us horizontal scaling without a consensus protocol.

The trade-off: Redis is an additional infrastructure dependency. For a product that already requires PostgreSQL for persistent storage, this is acceptable. The alternative -- building a distributed scheduling layer from scratch using Raft or Paxos -- would be architectural overkill for a $1.99/month service.

The Scheduler Struct

The scheduler is a struct with three fields:

rustuse chrono::Utc;
use cron::Schedule;
use redis::AsyncCommands;
use sqlx::PgPool;
use std::str::FromStr;
use std::sync::Arc;
use uuid::Uuid;

pub struct Scheduler {
    redis: redis::Client,
    pool: PgPool,
    config: Arc<AppConfig>,
}

impl Scheduler {
    pub fn new(redis: redis::Client, pool: PgPool, config: Arc<AppConfig>) -> Self {
        Self { redis, pool, config }
    }
}

redis::Client is a connection factory, not a connection itself. Each operation opens a multiplexed connection from the client. PgPool is SQLx's connection pool -- it manages a set of PostgreSQL connections and hands them out on demand. AppConfig holds runtime configuration (encryption key, server settings) wrapped in Arc for cheap sharing across async tasks.

The scheduler is created in main.rs and wrapped in Arc so it can be moved into the background task:

rust// In main.rs
let scheduler = Arc::new(Scheduler::new(
    state.redis.clone(),
    state.db.pool.clone(),
    Arc::clone(&state.config),
));
let _scheduler_handle = scheduler.start();
tracing::info!("Scheduler background task started");

The start() method spawns a tokio task and returns a JoinHandle. The handle is stored in _scheduler_handle -- the underscore prefix signals that we intentionally do not await or cancel it. The scheduler runs for the lifetime of the process.

The Polling Loop

The start() method is four lines:

rustpub fn start(self: Arc<Self>) -> tokio::task::JoinHandle<()> {
    tokio::spawn(async move {
        tracing::info!("Scheduler started");
        loop {
            if let Err(e) = self.poll().await {
                tracing::error!("Scheduler poll error: {e}");
            }
            tokio::time::sleep(std::time::Duration::from_secs(1)).await;
        }
    })
}

An infinite loop. Call poll(). Sleep 1 second. Repeat.

If poll() returns an error -- Redis is unreachable, a query fails -- the error is logged and the loop continues. The scheduler does not crash on transient failures. It logs, sleeps, and retries on the next tick. This is a deliberate design choice: a scheduler that dies on the first Redis timeout is useless in production where network blips are routine.

The 1-second sleep determines the system's time resolution. A job scheduled for 14:30:00 will execute between 14:30:00 and 14:30:01 -- at most one second of delay, compared to cron-job.org's 4-to-40-second jitter. For most use cases, sub-second precision is unnecessary. For the rare cases where it matters, the sleep duration is configurable.

The Poll Method: Where the Work Happens

This is the core of the scheduler:

rustasync fn poll(&self) -> AppResult<()> {
    let mut conn = self.redis
        .get_multiplexed_async_connection().await
        .map_err(|e| AppError::Redis(e))?;

    let now = Utc::now().timestamp() as f64;

    // Get all jobs due for execution (score <= now)
    let due_jobs: Vec<String> = conn
        .zrangebyscore("scheduled_jobs", 0.0_f64, now)
        .await
        .map_err(|e| AppError::Redis(e))?;

    for job_id_str in due_jobs {
        let job_id = match Uuid::parse_str(&job_id_str) {
            Ok(id) => id,
            Err(_) => {
                tracing::warn!("Invalid job ID in scheduled set: {job_id_str}");
                continue;
            }
        };

        // Attempt distributed lock
        let lock_key = format!("lock:{job_id}");
        let locked: bool = redis::cmd("SET")
            .arg(&lock_key)
            .arg("1")
            .arg("NX")
            .arg("EX")
            .arg(60)
            .query_async(&mut conn)
            .await
            .unwrap_or(false);

        if !locked {
            continue; // Another instance holds the lock
        }

        // Remove from sorted set
        let _: () = conn
            .zrem("scheduled_jobs", &job_id_str)
            .await
            .unwrap_or(());

        // Spawn executor task
        let pool = self.pool.clone();
        let config = self.config.clone();
        let redis_client = self.redis.clone();

        tokio::spawn(async move {
            let job = sqlx::query_as::<_, Job>(
                "SELECT * FROM jobs WHERE id = $1"
            )
            .bind(job_id)
            .fetch_optional(&pool)
            .await;

            match job {
                Ok(Some(job)) if job.status == "active" => {
                    if let Err(e) = executor::execute_job(
                        &job, &pool, encryption_key, "schedule", Some(&config)
                    ).await {
                        tracing::error!(job_id = %job.id,
                            "Execution failed: {e}");
                    }

                    // Compute next run and re-schedule
                    if let Ok(next) = compute_next_run(&job.schedule_cron) {
                        let next_ts = next.timestamp() as f64;
                        if let Ok(mut conn) = redis_client
                            .get_multiplexed_async_connection().await
                        {
                            let _: Result<(), _> = conn
                                .zadd("scheduled_jobs",
                                    job_id.to_string(), next_ts)
                                .await;
                        }

                        let _ = sqlx::query(
                            "UPDATE jobs SET next_run_at = $1 WHERE id = $2"
                        )
                        .bind(next)
                        .bind(job_id)
                        .execute(&pool)
                        .await;
                    }
                }
                Ok(Some(_)) => {
                    tracing::info!(job_id = %job_id,
                        "Job is not active, skipping");
                }
                Ok(None) => {
                    tracing::warn!(job_id = %job_id,
                        "Job not found in database");
                }
                Err(e) => {
                    tracing::error!(job_id = %job_id,
                        "Failed to fetch job: {e}");
                }
            }
        });
    }

    Ok(())
}

Let us walk through this step by step.

Step 1: Query due jobs

rustlet due_jobs: Vec<String> = conn
    .zrangebyscore("scheduled_jobs", 0.0_f64, now)
    .await?;

ZRANGEBYSCORE scheduled_jobs 0 <current_timestamp> returns all members of the sorted set whose score (next execution time) is less than or equal to the current time. These are the jobs that are due.

The sorted set members are job UUIDs stored as strings. The scores are Unix timestamps as floats. Redis maintains the set sorted by score, so this query is efficient even with hundreds of thousands of jobs -- it walks a skip list from the lowest score to the cutoff point.

Step 2: Acquire distributed lock

rustlet lock_key = format!("lock:{job_id}");
let locked: bool = redis::cmd("SET")
    .arg(&lock_key)
    .arg("1")
    .arg("NX")
    .arg("EX")
    .arg(60)
    .query_async(&mut conn)
    .await
    .unwrap_or(false);

if !locked {
    continue;
}

SET lock:<job_id> 1 NX EX 60 is the standard Redis distributed lock pattern. NX means "set only if the key does not exist" -- an atomic compare-and-set. EX 60 sets a 60-second expiration, which serves as a safety valve: if the instance that acquired the lock crashes mid-execution, the lock automatically releases after one minute.

If locked is false, another 0cron instance already claimed this job. The current instance skips it and moves to the next job in the due list.

This is not Redlock. We are not running Redis Sentinel or a Redis cluster with multiple masters. For 0cron's scale and price point, a single Redis instance with SET NX is sufficient. The failure mode -- two instances executing the same job during a Redis split-brain -- is an acceptable risk at this tier. If 0cron scales to the point where this matters, we upgrade to Redlock or a proper distributed coordination service. Engineering for the current scale, not the hypothetical one.

Step 3: Remove from sorted set and spawn

rustlet _: () = conn
    .zrem("scheduled_jobs", &job_id_str)
    .await
    .unwrap_or(());

After acquiring the lock, the job is removed from the sorted set. This prevents other instances from seeing it on their next poll. The order matters: lock first, then remove. If we removed first and then failed to lock, the job would disappear from the schedule without being executed.

The actual execution is spawned as a separate tokio task. This is critical for throughput: the poll loop must not block on individual job executions. If a job takes 30 seconds to complete (the user's endpoint is slow), the scheduler continues polling and dispatching other jobs in parallel.

Step 4: Execute and re-schedule

After execution, the scheduler computes the next run time from the cron expression and adds the job back to the sorted set with the new timestamp as its score. It also updates next_run_at in PostgreSQL so the dashboard can display the next scheduled execution.

rustif let Ok(next) = compute_next_run(&job.schedule_cron) {
    let next_ts = next.timestamp() as f64;
    let _: Result<(), _> = conn
        .zadd("scheduled_jobs", job_id.to_string(), next_ts)
        .await;

    let _ = sqlx::query("UPDATE jobs SET next_run_at = $1 WHERE id = $2")
        .bind(next)
        .bind(job_id)
        .execute(&pool)
        .await;
}

The compute_next_run function uses the cron crate to parse the expression and find the next matching time after now:

rustpub fn compute_next_run(
    cron_expr: &str,
) -> AppResult<chrono::DateTime<Utc>> {
    let schedule = Schedule::from_str(cron_expr)
        .map_err(|e| AppError::Validation(
            format!("Invalid cron expression: {e}")))?;

    schedule
        .upcoming(Utc)
        .next()
        .ok_or_else(|| AppError::Internal(
            "No upcoming schedule found".to_string()))
}

The cron crate parses standard cron expressions and generates an iterator of upcoming matching times. .next() returns the soonest one. This computation happens once per execution, not once per tick -- the result is stored in Redis as the new score.

The Executor: What Happens When a Job Fires

The scheduler dispatches; the executor runs. The execute_job function in executor.rs handles retries, and execute_single handles the HTTP request lifecycle.

The Retry Loop

rustpub async fn execute_job(
    job: &Job,
    db: &PgPool,
    encryption_key: &[u8],
    triggered_by: &str,
    config_app: Option<&AppConfig>,
) -> AppResult<Execution> {
    let config: JobConfig = serde_json::from_value(job.config.clone())?;

    let retry_config = config.retry.clone().unwrap_or(RetryConfig {
        attempts: 1,
        backoff: "exponential".to_string(),
        delay: 5,
    });

    let mut last_execution = None;

    for attempt in 0..retry_config.attempts {
        let execution = execute_single(
            job, &config, db, encryption_key,
            attempt as i32, triggered_by
        ).await;

        match &execution {
            Ok(exec) if exec.status == "success" => {
                update_job_stats(job.id, exec, db).await?;
                send_job_notifications(job, exec, db, config_app).await;
                return execution;
            }
            Ok(exec) => {
                last_execution = Some(exec.clone());
                if attempt + 1 < retry_config.attempts {
                    let delay = compute_backoff(&retry_config, attempt);
                    tokio::time::sleep(
                        std::time::Duration::from_secs(delay)
                    ).await;
                }
            }
            Err(e) => {
                tracing::error!(job_id = %job.id,
                    "Execution error: {e}");
                if attempt + 1 < retry_config.attempts {
                    let delay = compute_backoff(&retry_config, attempt);
                    tokio::time::sleep(
                        std::time::Duration::from_secs(delay)
                    ).await;
                }
            }
        }
    }

    // All retries exhausted
    if let Some(ref exec) = last_execution {
        update_job_stats(job.id, exec, db).await?;
        send_job_notifications(job, exec, db, config_app).await;
        return Ok(exec.clone());
    }

    Err(AppError::Internal(format!(
        "Job {} failed after all retries", job.id
    )))
}

The retry logic defaults to one attempt (no retry) with exponential backoff starting at 5 seconds. Users can configure up to 3 attempts with their choice of backoff strategy.

On success, the loop returns immediately after updating stats and sending notifications. On failure, it logs the attempt, stores the execution record, computes the backoff delay, sleeps, and retries. After all retries are exhausted, it reports the final failure.

Every attempt -- successful or not -- is recorded as an Execution row in PostgreSQL. The user can see every retry in their execution history, complete with the response body and status code from each attempt.

The Backoff Computation

rustfn compute_backoff(retry_config: &RetryConfig, attempt: u32) -> u64 {
    let base = retry_config.delay as u64;
    match retry_config.backoff.as_str() {
        "exponential" => base * 2_u64.pow(attempt),
        "linear" => base * (attempt as u64 + 1),
        "fixed" | _ => base,
    }
}

Three strategies:

Exponential: 5s, 10s, 20s, 40s... Doubles on each retry. Good for transient failures where backing off reduces contention.
Linear: 5s, 10s, 15s, 20s... Constant increment. Gentler than exponential; useful when the target service has rate limits with predictable reset windows.
Fixed: 5s, 5s, 5s... Same delay every time. For cases where the failure is either resolved or not, and waiting longer does not help.

The default (the _ catch-all arm) is fixed. If a user provides an unrecognized backoff strategy, they get predictable fixed-delay retries rather than an error.

The HTTP Request Builder

The execute_single function builds and sends the actual HTTP request:

rustasync fn execute_single(
    job: &Job,
    config: &JobConfig,
    db: &PgPool,
    encryption_key: &[u8],
    attempt: i32,
    triggered_by: &str,
) -> AppResult<Execution> {
    let started_at = Utc::now();
    let client = reqwest::Client::new();

    // Interpolate secrets in URL, headers, and body
    let url = secrets::interpolate_secrets(
        &config.url, job.team_id, db, encryption_key
    ).await?;

    let mut headers = HashMap::new();
    for (k, v) in &config.headers {
        let interpolated = secrets::interpolate_secrets(
            v, job.team_id, db, encryption_key
        ).await?;
        headers.insert(k.clone(), interpolated);
    }

    // Build the request
    let method: reqwest::Method = config.method.parse()?;
    let mut req = client
        .request(method.clone(), &url)
        .timeout(std::time::Duration::from_secs(
            config.timeout as u64
        ));

    for (k, v) in &headers {
        req = req.header(k.as_str(), v.as_str());
    }

    if let Some(ref b) = body {
        req = req.body(b.clone());
    }

    // Execute and capture everything
    let result = req.send().await;
    let finished_at = Utc::now();
    let duration_ms = (finished_at - started_at).num_milliseconds() as i32;

    // ... status extraction, response capture, DB insert
}

The secret interpolation step is worth highlighting. Users can store API keys and tokens as encrypted secrets and reference them in job configurations using ${secrets.API_KEY} syntax. Before the HTTP request is sent, every occurrence of ${secrets.<KEY>} in the URL, headers, and body is replaced with the decrypted value. The decryption uses AES-256-GCM with the server's encryption key -- secrets are never stored in plaintext and never logged.

The response capture is exhaustive. Every field is recorded:

Request: URL, method, headers, body
Response: status code, headers, body (truncated at 1MB)
Metadata: started_at, finished_at, duration in milliseconds, retry attempt number, triggered_by (schedule, manual, or API)
Error: if the request failed at the network level, the error message

This means users can debug failed jobs by looking at exactly what 0cron sent and exactly what came back. No guessing, no "it just says failed."

Schedule State Management

The scheduler module exports schedule_job and unschedule_job functions used by the job management service. The separation is clean: the scheduler only sees what is in the Redis sorted set; the job service controls what goes in and out.

The ordering is always database first, Redis second:

Creating a job = INSERT into PostgreSQL + ZADD to Redis
Pausing a job = UPDATE status in PostgreSQL + ZREM from Redis
Resuming a job = UPDATE status in PostgreSQL + ZADD to Redis
Deleting a job = DELETE from PostgreSQL + ZREM from Redis

If the two stores diverge (a crash between the INSERT and ZADD), the job exists in the database but is not scheduled -- a safe failure mode. The user sees their job in the dashboard and can re-save it to re-schedule. The dangerous alternative -- a job scheduled in Redis but missing from PostgreSQL -- never occurs because database writes always come first.

Why Not: Alternatives We Rejected

We evaluated four alternatives before settling on Redis sorted sets:

In-memory timer wheel (tokio sleep_until with a priority queue): fast, but volatile. A server restart loses all scheduling state. Rebuilding from the database on startup means scanning every active job -- a multi-second delay during which due jobs are not executing.

PostgreSQL LISTEN/NOTIFY: per-connection, not durable. If the scheduler disconnects (network blip, pool rotation), it misses notifications. You need a fallback polling mechanism anyway, which makes the notification layer redundant.

Message queue (RabbitMQ, SQS, Kafka): adds infrastructure complexity (broker management, dead letter queues, consumer groups) for durability guarantees we do not need. A missed tick is recovered on the next poll one second later.

Distributed scheduler (Dkron, Celery Beat): building 0cron on top of another scheduling engine means inheriting its complexity and operational requirements. Dkron requires Raft consensus and a cluster. Our scheduler is 199 lines. We understand every line.

Known Trade-offs

No system is perfect on the first iteration. Four items are on the v2 list:

ZRANGEBYSCORE + ZREM should be atomic. The distributed lock handles the race condition correctly, but ZPOPMIN or a Lua script would be cleaner.
The lock timeout is hardcoded at 60 seconds. It should be at least as long as the job's configured timeout plus a safety margin.
Redis and PostgreSQL operations are not transactional. A saga pattern or outbox table would provide stronger consistency.
The cron crate expects 7-field expressions (with seconds), while standard cron uses 5 fields. The NLP parser generates 5-field expressions with a 0 seconds prefix, which works but creates an impedance mismatch.

These are deliberate trade-offs at 0cron's current scale. Each has a clear upgrade path when the user count warrants it.

199 Lines

The entire scheduler -- polling, locking, dispatching, re-scheduling, plus the schedule_job and unschedule_job helper functions -- fits in 199 lines of Rust. The executor adds another 309 lines. Combined, 508 lines of code form the core of a cron job service that handles scheduling, execution, retries, logging, and notifications.

This is the benefit of choosing the right primitive. Redis sorted sets do the heavy lifting of maintaining a time-ordered queue with O(log N) operations. The cron crate handles expression parsing and next-run computation. reqwest handles HTTP. sqlx handles database access. Tokio handles async concurrency.

The application code -- our 508 lines -- is the glue. It connects these primitives into a coherent execution pipeline. There is no framework, no ORM, no dependency injection container, no event bus. Just structs, functions, and the Rust type system ensuring they fit together correctly.

Sometimes the best architecture is the simplest one that works.

This is Part 3 of a three-part series on building 0cron.dev. If you missed them: Part 1: Why the World Needs a $2 Cron Job Service covers the market analysis and pricing philosophy. Part 2: 4 Agents, 1 Product covers the parallel build methodology that produced the entire codebase in a single session.