At Deputy, it is our vision to build thriving workplaces in every community. We believe that trust and transparency underpin a thriving workplace, which is why we are sharing the details of a significant outage that occurred in January 2021. At Deputy, we are fully committed to delivering high standards of service, which includes keeping you up to date on the continuous investments we are making in our platform.
On the 25th of January, a system failure caused a full platform outage for Deputy customers in Australia between the hours of 8:39 am and 2:30 pm (6 hours).
This quickly led to an investment in the improvement of the underlying infrastructure of the Deputy platform to mitigate any future risk of platform outages in the future.
Our journey from outage to investment to achieve continuous platform stability, as we onboard more and more customers globally, has been detailed below. I will add a disclaimer; the following content may get technical!
The events of Monday the 25th January
January 25th is a unique and busy workday in Australia, especially for our customers. It’s the day before a public holiday, it’s the end of the month, and this year, it fell on a Monday, when many of our customers export timesheets and run payroll simultaneously.
At 8:39 am our automated system alerts triggered with Alert: Heavy Response Times. At the same time, our customer support team started receiving 100s of customer chats, indicating they had trouble accessing Deputy. The company triggered an Incident at this time and updated our accompanying status page.
Investigations began. Our software engineering team hadn’t released anything new that day, so no new code or infrastructure changes were present. Our code continued to pass all of our automated quality tests. Traffic to the login page had grown naturally through the morning, as expected.
Meanwhile, symptoms were surfacing. Our elastic servers kept adding and scaling more web servers to try to cope with increasing load. Digging one level deeper, our databases were seeing 10x-20x usual load. Continuing to dig, Redis, our in-memory cached database, which is normally used to drive high performance, was seeing an abnormally high amount of utilisation. It was at this point that we confirmed that Redis was the single point of failure → our scalable databases, and elastic web servers were all waiting for our one Redis storage unit, resulting in a cascading failure.
By midday, we had provisioned a new version of Redis, effectively restarted all of our processes and systems, and by 2:30 pm, Deputy was again accessible to our customers. However, the Redis risk remained, lurking - provisioning a new version was a patch fix. In fact, it came back in a smaller, more controlled way on a few other occasions through the next few weeks.
So what happened with Redis?
Deputy was using a non-scaling implementation of Redis across an entire region (i.e. Australia) as a caching solution. As our customer base has grown, this created a single failure point, resulting in workloads becoming heavy and concentrated.
We had outgrown our existing Redis architecture, and it was quickly made apparent to us that it was time to implement a more scalable solution. To make an analogy, we had 1 cash register for a very, very busy supermarket. Even as the supermarket got to peak capacity, we still had 1 cash register. In the new architecture, we have unlimited cash registers!
Our team went hard at work, consulting with our AWS enterprise architecture team, working nights and weekends, to develop a scalable, distributed Redis.
In this new architecture, our infrastructure now has 10x the Redis clusters to effectively spread and orchestrate workloads, and we continue to add new clusters as our customer base grows. In short, our infrastructure now reflects our requirements for today’s customers and future proofs us for our growth ambitions.
29th March, The False Start
On Monday the 29th, we released the new scalable and distributed Redis to all customers, with the intent to resolve these issues once and for all. However, as irony would have it, this inadvertently led to another outage on the 29th of March, due to settings of how the new system was tuned, which was quickly resolved.
19th April, All Systems Go Ready for 10x!
The previous outage on the 29th was a growing pain and speed bump to deliver the full working solution that is in production now, and handling usage elegantly and with ease!
This incident was a key catalyst in driving the constant journey we’ve been on to improve system resilience and systematically removing any single points of failure that may exist as our customer utilisation expands.
What Else Has Happened to drive our up-time and customer experience?
- Redis has been reworked and re-architected
- Increased Monitoring, alerts, and logs have been introduced in the application
- Circuit Breakers have been implemented to reduce the likelihood of cascading failures
- Elastic computing scaling rules have been adjusted to better handle scale up when required
Thank you for choosing Deputy
We understand this was an upsetting outage for our customers, especially on a payroll day, before a public holiday. We responded quickly to correct the situation, and have systematically dealt with Redis scalability as the root cause.
Thank you for your patience and understanding. We do not take for granted the trust you have placed on Deputy. We will continue to be on a journey to make Deputy highly available and your trusted partner while being open and transparent as we strive for continuous improvements.