Blog

ATS incident: Public report

author
CLEVR
Last Update
June 16, 2025
Published
January 1, 1970

Or How we fixed a production issue by working jointly and using all available tools

Let me tell you a short story about cooperation, troubleshooting and creative thinking. The date is 17 February 2021, 11:15 and this is an actual footage of the dev team behind Mendix application test suite (ATS) at the time:

86a4bbf3e1d18dcc1942e001eb969e03

You see, ATS crashed mere minutes ago. The age-old trick of switching it off then on again failed us. In fact, this was already the third crash in a matter of hours, after we had already restarted the app twice. Whatever was causing the app to crash was not going away. We had to fix it, fast.

This blog post aims to retell the events that transpired that day.

But I am getting ahead of myself, let's go back a bit.
How did we even know that the app just crashed?

Monitoring

In an ideal world, monitoring would be done by the platform. And indeed Mendix does detect if an app is down and sends automated alerts. However, their definition of an "app is down" is too strict. In our case the web server was still up and happily serving requests, but the Akka system (which afaik is responsible for microflow execution) had crashed making the application unusable. Even though the app was no longer accessible for end users, there was no alert from Mendix.

Luckily, we do not have to depend on the platform's alerting system alone. With the help of ATS we had set up several availability tests. They consist of logging-in and going through the main flows in the app. This kind of tests run continuously and are standard for all our apps. It was these availability tests that alerted our team that something had gone terribly, terribly wrong.

Investigation

As soon as the alerts came in, an emergency meeting took place. Everyone contributed by brainstorming theories about what was happening, while at the same time drafting a message for all ATS users to let them know that the issue is being looked into.

The next step after detecting an issue was to find the root cause. Again here we had help from our tooling, namely Mendix application performance and diagnostic (APD). What we needed was to identify possible causes for the crash, and APD comes with a perfect feature for that. The trap tool in APD listens to all logs at all log levels and records them if there is an error. This means that when an error occurs, we have all logging in a period around the error. In this way, we also have access to any trace or debug logging that would normally not be accessible. Ask anyone who has ever had to debug an issue on production, and they will tell you how invaluable good logging information is.

This is what we saw in APD:

Uncaught fatal error from thread [MxRuntimeSystem-action-dispatcher-16] shutting down ActorSystem [MxRuntimeSystem] maximum stack depth reached

ats incident

The traps in APD pointed convincingly to a stack-overflow error. Named after the popular Q&A website (or was it the other way around) this error is usually caused by an endless recursion. Recursion in programming is when a function (or in the case of Mendix, a microflow) calls itself again and again indefinitely before eventually blowing the call stack. When this happens the entire Akka system crashes and the application can no longer run microflows or any kind of logic.

ats incident 2

We knew already that the error is caused by a microflow calling itself, but we were not sure which microflow exactly was causing this. Again we turned to our tooling for help, this time to ACR. ACR has many rules to prevent issues in your application and just happened to have a rule for recursion.

As the rule correctly points out, Mendix does not have a built-in fail-safe for such scenarios. I hope Mendix picks up on this and adds it to the platform, sounds like something that should definitely be covered and give a simple error instead of crashing the app completely.

Anyways, ACR provided us with a comprehensive list of microflows that were violating the recursion rule. By cross-referencing this with the trap logs from APD we could identify the most likely culprit. A quick check in the code and we could indeed confirm that there was a split (if) missing. If the microflow was called with an unlikely (but possible) combination of data it would result in an endless recursion and the above-mentioned stack-overflow error.

Patching the issue

After identifying the issue we turned to developing a fix. Adding that missing split (if) would probably cut it, but we wanted something even safer. Something that could never fail. The ACR rule documentation already outlines an easy way to achieve this.

In case you decide to have a recursion (notwithstanding the above), add an emergency brake that aborts the execution if the recursion executes itself too many times (for example, 200) to avoid the app from crashing.

ats incident 3-png

The descried fix is also known as depth check. A depth check is easy to add but if we made a mistake and for example, forget to increment the depth somewhere or reset the depth at the wrong place we might very well be back to where we started. We could not afford that.

Instead, Bart Tolen (a senior member of the team) wrote a simple depth check that works with the actual depth of the Mendix call stack. Since there is no longer a need to manually keep track of the depth we needn't worry about forgetting to increment the depth. After testing the depth check in an isolated project and confirming that it works as expected we added it to our product app, confident that we have taken every precaution and that it will not cause further harm.

This is the depth check java action, feel free to use it in your project.

CheckCallStack.mpk

Release

After the regression suite of tests passed we could proceed with the release confident that we did not break something else while rushing to develop this fix. A funny side note is that these regressions tests are actually executed with ATS, in a recursive-like fashion where ATS tests itself.

At 17:15, only a few hours later that same day the patch was live. Since then, it has been more than a week, and we have not seen the issue resurface. We had done it.

With the distance of time between us and the incident, I can clearly identify two key elements that helped us that day to fix the issue so quickly:

  • Smart Digital Factory Tooling - Without ATS it would probably have been a few hours until we heard about the issue through a support ticket. Without APD and ACR it would have taken us much longer to pinpoint the root cause of the issue. The team built these tools to help other developers. In this case they helped us, and served as a great reminder about the importance of tooling and the work that we do.
  • Teamwork - Even if we had all the tooling in the world it would not help much if we did not have a team that can join forces and tackle complex problems. From the first alert until the resolution everyone on the team got engaged to lend help in whatever way that was needed. Team members were scouring APD logs, forums, contacting users, brainstorming, testing potential fixes, all at the same time. This level of cooperation was the most important element that empowered us to fix the issue so fast.

I hope you enjoyed reading this post and that it helps you troubleshoot incidents.

Find out how CLEVR can drive impact for your business

Contact us

Frequently Asked Questions

Can't find the answer to your question? Just get in touch

No items found.
join the newsletter

Receive personal news and updates in your inbox

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
CLEVR Company picture Alicia - Ech