Site outage Oct 3-4, 2018
During the night on Oct 4, between the hours of 1:30 and 6:40 AM, the Selma websites www.selma.io and www.selmafinance.ch were unavailable. During the outage, the Selma app continued working normally internally and there were no security issues.
The issue occurred between Cloudflare, a security layer on top of our app and the Selma app itself. The problem was caused by a security improvement that we introduced called CSP headers. Due to a bug in Ruby on Rails, the application framework we use, the total size of the headers the Selma app was sending gradually became too large for Cloudflare to handle. Upon noticing the situation in the morning, we immediately applied a fix.
This marks the first outage longer than a couple of minutes in the one-year history of Selma. Since we are committed to full transparency, we will always post a full description of what happened and what actions we are taking to make sure it will not happen again.
On October 3rd, our site went down first for about 30 minutes and later on the night between October 3rd and 4th, for about 5 hours. The problem only affected the availability of the site to the outer world, the application itself kept working during the whole time.
What went wrong?
The large CSP header alone wouldn’t cause the site to go down, though. However, Cloudflare, the service we use as an extra security layer on top of our site, has a limit of 8kb for response headers. Once the headers grew past that limit, it started sending an error response to users, even though our app sent a successful response itself.
How was the issue fixed?
The issue was fixed on the short term by turning the CSP off on our site. On the longer term the recently released Rails 5.2.1 has the issue fixed, so upgrading our Rails version takes care of the issue.
How can we prevent similar issues from happening in the future?
The creeping nature of the issue made it hard to detect, because everything worked initially and from the app’s perspective, everything was still fine even though Cloudflare kept rejecting its responses. The issue also didn’t materialise on our staging environment, because it wasn’t behind Cloudflare and it was never hit by enough traffic to detect the issue to begin with.
In order to mitigate the issue, we are writing tests for checking that successive requests won’t increase the size of the CSP header, which should prevent regressions in this particular issue. For similar, but different issues, we will change our staging setup to match the Cloudflare setup in production.