163 minutes of observed downtime for ResearchEquals

On July 14th, 2025 at 12:23PM Berlin time, we observed an error on ResearchEquals that created two hours of downtime. This blog post documents the incident so you know what happened, what went wrong, and what we're improving as a result.

What happened

ResearchEquals responded with a 503 Error upon loading through https://www.researchequals.com. We observed this issue upon a manual load of the webpage and it persisted throughout all subpages. As a result, ResearchEquals was completely down.

In order to properly resolve the issue, we switched the website into maintenance mode at 12:34PM local time and communicated the downtime on Mastodon. By 3:06PM local time, we deployed a fix and normal operations resumed.

What went wrong

Firstly, our uptime monitor did not observe the issue. This means that our detection mechanism through our uptime checks is not functioning properly for client-side errors. This means we cannot reliably know how long this issue had been happening already. Potentially, the downtime started a week earlier and we simply did not know.

Secondly, the practical issue leading to the 503 error was an excessively generous database query when logged out. This caused the querying and processing of hundreds of draft modules on the server, for each page load. This overran the servers and caused the "503 Service Unavailable" error. Our fix ensures the query only runs when logged in. Nobody gained access to draft modules not associated with their workspace.

What we are improving

As this incident indicates, we need to improve the reliability and specificity of our uptime monitors. We currently include a general metric for the overall service, but that does not help us pinpoint issues in the specific services. In the future, we will improve our client-side monitoring for errors, in order to detect issues earlier on. We are also restructuring how we manage our services in our upgraded version 2, which will address this issue.

What we also observed is that the codebase for ResearchEquals version 1 is getting outdated. This is the result of limited maintenance availability over the past eighteen months. As we stated in our recent blog post, we are ramping up again – and in order to improve the overall ResearchEquals service we will do two things:

Further reduce complexity of version 1 – this will help reduce the incident vector overall
Implement improved observability in version 2 – this will improve our capacity to handle issues in the future and respond to issues promptly

We cannot commit to implementing improved observability features in version 1 right now, as that would detract from our work to develop version 2. If we had more resources, we would do so, but the reality is that we currently have circa 20 hours per week to do all development of ResearchEquals, and we want to ensure that we can provide version 2 by the new year.

We thank you for being on this journey with us. Building and maintaining production services means these kinds of issues arise. We consider it important to transparently report on them, and not hide these incidents. That we make mistakes is inevitable – it matters how we deal with them.

We look forward to also reporting good news with you soon 😊