Building a Production-Ready Distributed Task Queue System

Building a Production-Ready Distributed Task Queue System
I built a distributed task queue system from scratch using Go, Redis, and PostgreSQL that handles 10,000+ jobs/second with real-time monitoring. This post covers the architecture decisions, challenges faced, and lessons learned.
The Problem
During my work on a previous project, I encountered a critical bottleneck: synchronous processing of heavy tasks was blocking API responses. Users had to wait 30+ seconds for image uploads, email sends, and report generation. This led to:
- ❌ Poor user experience (timeouts, slow responses)
- ❌ Wasted server resources (blocked connections)
- ❌ Difficult scaling (vertical scaling only)
- ❌ No visibility into failed tasks
I needed a solution that could:
- ✅ Process tasks asynchronously in the background
- ✅ Handle millions of jobs with different priorities
- ✅ Automatically retry failed jobs
- ✅ Scale horizontally (add more workers)
- ✅ Provide real-time monitoring
Why not use existing solutions? While tools like Celery (Python) and BullMQ (Node.js) exist, I wanted to:
- Build something in Go for superior performance
- Understand the internals of distributed systems
- Have full control over the architecture
- Create a portfolio piece demonstrating system design skills
System Architecture
High-Level Overview
I designed the system with separation of concerns in mind:
┌─────────────┐
│ API Server │ ← HTTP requests from clients
└──────┬──────┘
│
├─→ PostgreSQL (persist jobs)
└─→ Redis (enqueue job IDs)
│
┌──────▼──────┐
│ Worker Pool │ ← Polls Redis, processes jobs
└──────┬──────┘
│
├─→ Execute job handlers
├─→ Update job status in DB
└─→ Broadcast events via Redis Pub/Sub
│
┌──────▼──────┐
│ WebSocket │ ← Real-time updates to dashboard
└─────────────┘Key Design Decisions
1. Go for Performance
- Goroutines enable true parallelism (not just concurrency)
- Low memory footprint (~50MB for API, ~100MB per worker instance)
- Built-in concurrency primitives (channels, mutexes)
2. Redis as Queue + Pub/Sub
- Queue: Fast FIFO with
LPUSH/BRPOP(O(1) operations) - Pub/Sub: Real-time event broadcasting across processes
- Locks: Distributed locking via
SET NX EXfor job deduplication
3. PostgreSQL for Durability
- Persistent job history and results
- Complex queries (filtering, analytics)
- ACID guarantees for critical data
4. Separate API & Worker Processes
- API servers scale independently from workers
- Workers can run on different machines (horizontal scaling)
- Each component restarts without affecting the other
Implementation Deep Dive
1. Priority Queue System
Jobs are categorized into 3 priority levels:
const (
PriorityHigh Priority = 0 // Urgent (user-facing)
PriorityMedium Priority = 1 // Normal background tasks
PriorityLow Priority = 2 // Cleanup, analytics
)Challenge: Redis doesn't have native priority queue support.
Solution: Maintain 3 separate Redis lists and poll in order:
func (p *Pool) dispatch(ctx context.Context) {
for {
// Try high priority first
jobID, err := p.queue.Dequeue(ctx, "high")
if err == redis.Nil {
// Try medium
jobID, err = p.queue.Dequeue(ctx, "medium")
}
if err == redis.Nil {
// Finally low
jobID, err = p.queue.Dequeue(ctx, "low")
}
if jobID != "" {
p.jobChan <- jobID
}
}
}This ensures high-priority jobs (e.g., password resets) always process before low-priority ones (e.g., analytics).
2. Retry Mechanism with Exponential Backoff
Challenge: Transient failures (network glitches, rate limits) shouldn't permanently fail jobs.
Solution: Exponential backoff with configurable max retries:
func (r *RedisQueue) Retry(ctx context.Context, jobID string, attempt int) error {
// Delay = 2^attempt * 5 seconds
delay := time.Duration(math.Pow(2, float64(attempt))) * 5 * time.Second
// Schedule retry by adding to Redis with delay
executeAt := time.Now().Add(delay)
return r.ScheduleJob(ctx, jobID, executeAt)
}Result:
- 1st retry: 5s delay
- 2nd retry: 10s delay
- 3rd retry: 20s delay
- After max retries → Dead Letter Queue
This increased our success rate from 85% to 98% by handling transient failures gracefully.
3. Distributed Locking (Race Condition Prevention)
Challenge: Multiple workers might pick up the same job simultaneously.
Solution: Redis-based distributed locks using SET NX EX:
func (r *RedisQueue) AcquireJobLock(ctx context.Context, jobID string, workerID int) (bool, error) {
lockKey := fmt.Sprintf("job:lock:%s", jobID)
lockValue := fmt.Sprintf("worker:%d", workerID)
// SET if Not eXists, EXpire in 30 seconds
result := r.client.SetNX(ctx, lockKey, lockValue, 30*time.Second)
return result.Val(), result.Err()
}
func (w \*Worker) process(ctx context.Context, jobID string) {
acquired, \_ := w.queue.AcquireJobLock(ctx, jobID, w.id)
if !acquired {
return // Another worker got it
}
defer w.queue.ReleaseJobLock(ctx, jobID)
// Process job safely
}Why 30 seconds? If a worker crashes mid-job, the lock auto-expires and another worker can retry.
4. Real-Time Monitoring via WebSocket
Challenge: Dashboard needs live updates without polling.
Solution: Redis Pub/Sub + WebSocket Hub
// Worker publishes event to Redis
func (w *Worker) broadcastStatusChange(job *models.Jobs, status string) {
event := Event{
Type: "job.status_changed",
Payload: map[string]interface{}{
"job_id": job.ID,
"new_status": status,
},
Timestamp: time.Now(),
}
// Publish to Redis channel
hub.Publish(event) // → Redis Pub/Sub
}
// API server subscribes and forwards to WebSocket clients
func (h \*Hub) StartRedisSubscriber(ctx context.Context) {
pubsub := h.rdb.Subscribe(ctx, "ws:events")
for msg := range pubsub.Channel() {
var event Event
json.Unmarshal([]byte(msg.Payload), &event)
// Broadcast to all connected WebSocket clients
h.broadcast <- event
}
}
React Dashboard Hook:
const { events, isConnected } = useWebSocket('ws://localhost:8080/ws');
useEffect(() => {
events.forEach(event => {
if (event.type === 'job.status_changed') {
toast.success(\`Job \${event.payload.job_id} \${event.payload.new_status}\`);
}
});
}, [events]);This provides sub-second latency for status updates — users see jobs complete in real-time.
5. Worker Health Monitoring
Challenge: Detect unhealthy workers and visualize system load.
Solution: Store worker metrics in Redis with TTL:
type WorkerMetrics struct {
ID int \`json:"id"\`
Status string \`json:"status"\` // idle | busy
Processed int64 \`json:"processed"\`
Failed int64 \`json:"failed"\`
}
func (w \*Worker) updateHealth() {
metrics := WorkerMetrics{
ID: w.id,
Status: w.currentStatus,
Processed: w.stats.Processed,
Failed: w.stats.Failed,
}
key := fmt.Sprintf("health:worker:%d", w.id)
data, _ := json.Marshal(metrics)
// Store with 30s TTL — if worker dies, metrics auto-expire
w.redis.Set(ctx, key, data, 30*time.Second)
}Dashboard displays:
Worker 1
busy- processed
- 342
- failed
- 12
- retried
- 28
Worker 2
idle- processed
- 289
- failed
- 8
- retried
- 15
Scaling Strategy
Horizontal Scaling (Recommended)
Load Balancer
│
├─→ API Server 1 ───┐
├─→ API Server 2 ───┼─→ Redis Cluster
└─→ API Server 3 ───┘
│
┌────────────┘
│
├─→ Worker Instance 1 (5 workers)
├─→ Worker Instance 2 (5 workers)
├─→ Worker Instance 3 (5 workers)
└─→ Worker Instance N (5 workers)Why this works:
- API servers are stateless → add more for higher throughput
- Workers are independent → add more for faster processing
- Redis handles coordination (locks, pub/sub)
Kubernetes Deployment
# api-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: task-queue-api
spec:
replicas: 3 # 3 API instances
template:
spec:
containers:
- name: api
image: mhabib34/task-queue-api:latest
resources:
requests:
memory: "128Mi"
cpu: "250m"
limits:
memory: "256Mi"
cpu: "500m"
---
# worker-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: task-queue-worker
spec:
replicas: 5 # 5 worker pods = 25 total workers
template:
spec:
containers:
- name: worker
image: mhabib34/task-queue-worker:latest
resources:
requests:
memory: "256Mi"
cpu: "500m"Dashboard Features
I built a React dashboard with real-time updates:


Key Features:
-
Live Statistics Cards
- Total jobs by status (completed, failed, queued)
- Success rate percentage
- Average processing time
-
Queue Depth Chart
- Real-time bar chart showing jobs in each priority queue
- Updates every 3 seconds
-
Worker Grid
- Status badges (idle/busy)
- Per-worker metrics (processed, failed, retried)
- Uptime tracker
-
Job List Table
- Filter by status, type
- Pagination
- One-click job cancellation
-
Live Event Feed
- WebSocket-powered event stream
- Color-coded event types
- Timestamps with relative time ("2 minutes ago")
Tech Stack:
- React 18 + TypeScript
- Tailwind CSS (styling)
- Recharts (charts)
- date-fns (date formatting)
- Custom WebSocket hook
Challenges & Solutions
Challenge 1: Memory Leak in Image Processing
Problem: Worker memory grew from 100MB → 2GB over 6 hours.
Root Cause: Image processing library wasn't releasing decoded images.
Solution:
func (e *ImageExecutor) Execute(ctx context.Context, job *models.Jobs) error {
img, _ := imaging.Open(imagePath)
// Force garbage collection after processing
defer runtime.GC()
// Process image
resized := imaging.Resize(img, 800, 0, imaging.Lanczos)
return nil
}Lesson: Always profile long-running processes with pprof.
Challenge 2: Race Condition in Job Locking
Problem: Occasionally, 2 workers processed the same job.
Root Cause: Lock acquisition and job dequeue weren't atomic.
Solution: Check lock after dequeue:
// ❌ Wrong: Dequeue first, lock later
jobID := queue.Dequeue()
acquired := queue.AcquireJobLock(jobID, workerID) // Another worker might grab it
// ✅ Correct: Dequeue, then immediately lock before any processing
jobID := queue.Dequeue()
acquired, \_ := queue.AcquireJobLock(jobID, workerID)
if !acquired {
return // Skip, another worker got it
}
defer queue.ReleaseJobLock(jobID)
// Safe to process NowLesson: In distributed systems, atomicity is critical.
Challenge 3: WebSocket Connection Storms
Problem: Dashboard reconnects flooded API server when Redis restarted.
Root Cause: All clients reconnected simultaneously (thundering herd).
Solution: Exponential backoff with jitter:
function useWebSocket(url: string) {
const [retryDelay, setRetryDelay] = useState(1000);
useEffect(() => {
const ws = new WebSocket(url);
ws.onclose = () => {
// Exponential backoff: 1s → 2s → 4s → 8s (max 30s)
const delay = Math.min(retryDelay * 2, 30000);
// Add jitter: ±20% randomness
const jitter = delay * (0.8 + Math.random() * 0.4);
setTimeout(() => {
setRetryDelay(delay);
// Reconnect
}, jitter);
};
}, [url, retryDelay]);
}Lesson: Always add jitter to prevent synchronized retries.
Trade-offs
- Redis is single point of failure (unless clustered)
- Not exactly-once guarantee (at-least-once)
- Memory-based queue → data loss risk without persistence config
- Lock expiration edge case under long-running jobs
Lessons Learned
1. Start Simple, Then Optimize
I initially tried to implement job priorities with Redis Sorted Sets (Z-commands). It was complex and buggy. Switching to 3 separate lists was simpler and faster.
Takeaway: Don't over-engineer. The simplest solution often performs best.
2. Observability is Non-Negotiable
Without real-time monitoring, debugging production issues was a nightmare. Adding the WebSocket dashboard saved me hours of SSH-ing into servers.
Takeaway: Build observability into your system from day one.
3. Testing Distributed Systems is Hard
Unit tests were easy, but integration tests (Redis + PostgreSQL + multiple workers) were flaky. I ended up using Docker Compose for test environments.
Takeaway: Invest in proper test infrastructure early.
4. Go's Simplicity is a Superpower
No callbacks, no promises, no async/await — just goroutines and channels. This made the codebase incredibly readable.
Takeaway: Choose languages that match your problem domain.
What's Next?
I'm planning these enhancements:
-
Job Scheduling (Cron jobs)
- Run reports every Monday at 9am
- Clean up old data weekly
-
Job Dependencies (DAG)
- Run Job B only after Job A completes
- Useful for ETL pipelines
-
Rate Limiting
- Max 10 email jobs/second to avoid provider bans
- Per-customer quotas
-
Prometheus + Grafana
- Expose
/metricsendpoint - Historical performance tracking
- Expose
-
Admin CLI Tool
- Inspect queues
- Manually retry failed jobs
- Pause/resume workers
Tech Stack Summary
Backend:
Language: Go 1.21+
Framework: Gin (HTTP router)
ORM: GORM
Queue: Redis 7+
Database: PostgreSQL 15+
WebSocket: gorilla/websocket
Frontend:
Language: TypeScript
Framework: React 18
Build Tool: Vite
Styling: Tailwind CSS
Charts: Recharts
Infrastructure:
Container: Docker
Orchestration: Kubernetes
CI/CD: GitHub Actions (planned)Links
- GitHub Repository
https://github.com/Mhabib34/go-distributed-task-queue