The application: Marketing automation at scale
Our platform allows clients to create marketing campaigns, schedule messages, and track their performance. The goal is to deliver as many messages as possible in the shortest time, without breaking the system.
From the start, we knew we needed background processing to handle large-scale message dispatching. We built our application on Ruby on Rails, using ActiveJob with DelayedJob for background tasks. Our database of choice was PostgreSQL, which managed job scheduling and analytics tracking.
Initially, everything ran smoothly. But when we tried to scale up message delivery, problems started to pile up.
Scaling gone wrong: The challenges we faced
Problem #1: More workers ≠ faster processing
We started with 10 background workers handling message dispatching, processing around 1,000 messages efficiently. We thought increasing this to 50 workers would speed things up.
Expected result: Faster message delivery.
Actual result: The application became slower and unresponsive.
The root cause: database connection limits
Each worker needed a separate database connection to fetch and process jobs. However, our PostgreSQL database instance had a limit of 30 concurrent connections. With 50 workers running, many were left waiting for database access, creating a bottleneck instead of improving performance.
The fix: scale the database, not just the app
We upgraded to a more powerful PostgreSQL instance, optimized connection pooling, and reworked our database queries to be more efficient. This helped eliminate connection-based slowdowns.
Problem #2: Too many database queries
Once we upgraded the database, we pushed the system further, could it handle 10,000 messages? Yes, but we hit another roadblock: slow job execution times due to excessive database queries.
The root cause: inefficient job execution
Each background job was making multiple database queries, checking job status, fetching recipient data, logging progress. This added unnecessary load, slowing everything down.
The fix: optimize data handling
Instead of making multiple queries per job, we redesigned the job processing:
-
Pre-load required data into memory, reducing the number of database reads.
-
Batch database writes, so each worker updated the database only once per cycle.
This dramatically reduced query load, improving performance significantly.
Problem #3: The 100,000 message challenge
After optimizing job execution, we could comfortably process 10,000 messages. But what about 100,000?
Here’s where things really broke down: memory leaks in DelayedJob caused 30% of jobs to crash mid-execution, leaving thousands of messages undelivered.
The root cause: memory management in background jobs
DelayedJob wasn't designed to handle extremely large workloads efficiently. Each job was holding too much in-memory state, causing memory bloat and eventual crashes.
The fix: modular job processing
Instead of one job per message, we introduced a three-tier job system:
-
Tier 1: A main job that kicks off batch processing.
-
Tier 2: Creates jobs for handling chunks of 1,000 messages each.
-
Tier 3: Each of those jobs creates smaller individual message jobs.
By breaking large jobs into smaller, manageable ones, we eliminated memory leaks and ensured that no single worker was overloaded.
How to scale without breaking everything
If you’re working on an application that needs to scale, here are some key takeaways from our experience:
1. Start small, scale smart
Instead of over-engineering from day one, build a solid foundation and scale gradually. Test how your system behaves under increasing loads before things break in production.
2. Use the right tools for background jobs
Not all background job processors are built for scale. If you're working with high-volume data, consider switching from DelayedJob to Sidekiq (or a more efficient alternative) before hitting performance issues.
3. Optimize database usage
-
Reduce unnecessary queries—batch your writes and cache reads.
-
Use connection pooling to prevent bottlenecks.
-
Optimize indexes and query execution plans.
4. Monitor performance in real time
-
Implement application monitoring tools like New Relic, Datadog, or Prometheus.
-
Track database query performance and adjust as needed.
-
Log everything—but keep logs efficient to avoid excessive I/O overhead.
5. Break large jobs into smaller chunks
Handling 100,000+ jobs? Split them into smaller, modular tasks to prevent memory bloat and processing slowdowns.
Final thoughts
Scaling an application isn’t just about adding more servers or workers, it requires careful optimization of every part of the system. By learning from our mistakes, you can avoid costly downtime and ensure your app scales smoothly.
Remember, scaling is an ongoing process that demands continuous monitoring and adjustment. Embrace the challenges and opportunities that come with growth, and always be ready to adapt your strategies to meet new demands.