When building a software product, I’ve always found it indispensable to have reliable escape lanes that can be taken while in growth pain to win some time.
One of those reliable mechanisms that provides workarounds and different alternatives is background workers and tasks. Whether you’re processing user-generated data, sending out notifications, or interacting with third-party APIs, background tasks play a crucial role in ensuring your application runs smoothly without overloading the user-facing processes.
I was very excited to see that Django will be bringing a native interface for this with a default implementation. (Sidenote: Celery rocks and I sincerely hope that as a project they can use the new Django capabilities).
So I wanted to discuss guidelines and which tradeoffs are available while implementing Tasks for background workers. This is strongly based on Python, but it certainly ports to other technologies.
Taming the beast
The biggest change when introducing background workers is that you just transformed your not-so-distributed system into a distributed system, and complexity rises significantly. Therefore, it’s particularly important to mitigate potential drawbacks via strong reliability and efficient operation.
Most libraries or platforms to handle tasks will have a set of tools and primitives to address these issues uniformly and in a standard fashion. You should check that out when writing code.
Reliability
When implementing background tasks, reliability is key. Here are some best practices to follow.
Ensure that your tasks are idempotent
That is, ensure that these can be run multiple times without causing unintended side effects. This is crucial for tasks that might need to be retried after failures.
Some things might be somehow idempotent by nature (sending an email reminder) since there is no (big) harm in occasionally sending twice. However submitting a return to a payment processor might take your business down.
Error Handling & Retries
Background tasks often involve external systems or complex processes that can fail beyond our control. Be defensive, and implement robust error handling that is very semantically clear to address if needed. Ideally, set up retry mechanisms to handle temporary issues without manual intervention.
Watch for usage
Background tasks can consume significant system resources in stressed conditions. Have some form of rate limiting and queue management to prevent resource exhaustion. Ideally, some form of circuit breaker is always nice to have baked in.
Task Prioritization
Not all tasks are created equal. Prioritize critical tasks, such as payment processing or real-time notifications, to ensure they are handled promptly. In my experience, the best way to do this is via segmenting workers by queue. You’ll be able to scale workers for given queues and avoid resource-draining other queues for instance.
Appropriate Division of Labor
Sometimes an initial design for a task seems reasonable, but as the system grows it becomes a huge task that takes forever to run in a single worker. There is a sweet spot between having fewer tasks (less complexity) and better balancing (more reliable).
You should be thoughtful of finding eventual data keys to do some aggregations and how those might allow for a better distribution of labor among workers.
Operation
To ensure your system remains stable and responsive, consider these operational best practices.
Monitoring
Use tools like Celery Flower to monitor the status of your background tasks. Keep an eye on task completion times, failures, and retries. Send Slack or incident notifications if queue lengths start to get fishy.
Logging & Auditing
Have a log aggregator and keep configurable level detailed logs of task executions, including timestamps, results, and errors. You might want to log everything when responding to an issue, but it can bring your logging system to its knees if you can’t disable that behavior via configuration.
Handling Scalability
As your application grows, so will the demand on your background task system. The fast trigger to handle spikes should be increasing workers for strained queues. But it will come a time to think about optimizing task performance and changing workloads.
Prepare for disaster
The best piece of advice is that you should not be afraid to do a disaster recovery, and you should be able to. This will not be a tool that you should be using that much, but having it in place makes sure that you have other things in working condition (tasks are idempotent, workers are highly scalable, queues and priorities are efficient).
Prepare for the unexpected by implementing backup and recovery strategies. Ensure that critical tasks can be resumed or retried after a failure, and consider using a distributed queue system to minimize downtime. I strongly suggest RabbitMQ, particularly when running in Kubernetes as there is a fantastic operator to do so.
