Site outage Oct 3-4, 2018

During the night on Oct 4, between the hours of 1:30 and 6:40 AM, the Selma websites www.selma.io and www.selmafinance.ch were unavailable. During the outage, the Selma app continued working normally internally and there were no security issues.

The issue occurred between Cloudflare, a security layer on top of our app and the Selma app itself. The problem was caused by a security improvement that we introduced called CSP headers. Due to a bug in Ruby on Rails, the application framework we use, the total size of the headers the Selma app was sending gradually became too large for Cloudflare to handle. Upon noticing the situation in the morning, we immediately applied a fix.
This marks the first outage longer than a couple of minutes in the one-year history of Selma. Since we are committed to full transparency, we will always post a full description of what happened and what actions we are taking to make sure it will not happen again.

Post-mortem

On October 3rd, our site went down first for about 30 minutes and later on the night between October 3rd and 4th, for about 5 hours. The problem only affected the availability of the site to the outer world, the application itself kept working during the whole time.

What went wrong?

In the morning of October 3rd, we deployed a security measure called CSP (Content Security Policy), which adds extra prevention for security exploits such as Cross-Site Scripting (XSS). Because using a CSP can cause hidden issues with third-party JavaScript (such as FB Connect), we deployed it in a report-only mode. This means browsers won’t block code that break the policy, but report it to us. The deploy went smoothly and the site worked correctly.

One mechanism in CSP is to define a unique nonce that then needs to be included in JavaScript elements in order for them to be allowed to run. This nonce is supposed to be a new one for every request. However, due to Issue #32597: CSP nonces being added to header after every request indefinitely by Envek in rails/rails on GitHub in Ruby on Rails 5.2.0, the server kept producing and adding nonces to the CSP header instead of replacing the previous nonce. This caused the response headers to grow linearly after every request. Every time the site was redeployed, the number of nonces started from 1 again.

The large CSP header alone wouldn’t cause the site to go down, though. However, Cloudflare, the service we use as an extra security layer on top of our site, has a limit of 8kb for response headers. Once the headers grew past that limit, it started sending an error response to users, even though our app sent a successful response itself.

How was the issue fixed?

The issue was fixed on the short term by turning the CSP off on our site. On the longer term the recently released Rails 5.2.1 has the issue fixed, so upgrading our Rails version takes care of the issue.

How can we prevent similar issues from happening in the future?

The creeping nature of the issue made it hard to detect, because everything worked initially and from the app’s perspective, everything was still fine even though Cloudflare kept rejecting its responses. The issue also didn’t materialise on our staging environment, because it wasn’t behind Cloudflare and it was never hit by enough traffic to detect the issue to begin with.

In order to mitigate the issue, we are writing tests for checking that successive requests won’t increase the size of the CSP header, which should prevent regressions in this particular issue. For similar, but different issues, we will change our staging setup to match the Cloudflare setup in production.

💙Selma development crew