このコンテンツはご指定の言語に対応していないため、オリジナル言語にて表示されます。

Postmortem of Service Outage at 3.4M CCU

2018.2.8
The Epic Team
Fortnite hit a new peak of 3.4 million concurrent players last Sunday… and that didn’t come without issues!  This blog post aims to share technical details about the challenges of rapidly scaling a game and its online services far beyond our wildest growth expectations.

Also, Epic Games needs YOU!  If you have domain expertise to solve problems like these, and you’d like to contribute to Fortnite and other efforts, join Epic in Seattle, North Carolina, Salt Lake City, San Francisco, UK, Stockholm, Seoul, or elsewhere!  Please shoot us an email at [email protected].

Postmortem of February 3rd and 4th outages


The extreme load caused 6 different incidents between Saturday and Sunday, with a mix of partial and total service disruptions to Fortnite.

MCP database latency

Fortnite has a service called MCP (remember the Tron nemesis?) which players contact in order to retrieve game profiles, statistics, items, matchmaking info and more. It’s backed by several sets of databases used to persistently store this data. The Fortnite game service is our largest database to date.

The primary MCP database is comprised of 9 MongoDB shards, where each shard has a writer, two read replicas, and a hidden replica for redundancy. At a high level user specific data is spread across 8 shards, whereas the remaining shard contains matchmaking sessions, shared service caches, and runtime configuration data.

The MCP is architected such that each service has a db connection pool to a sidecar process that in turn maintains a connection pool to all of our shards. At peak the MCP handles 124k client requests per second, which translates to 318k database reads and 132k database writes per second with a sub 10ms average database response time. Of that, matchmaking requests account for roughly 15% of all db queries and 11% of all writes across a single shard. In addition our current matchmaking implementation requires data to be in a single collection.

At peak we see an issue where the matchmaking shard begins queuing writes waiting on available writer resources. This can cause db update times to spike in the 40k+ ms range per operation causing MCP threads to block. Players experience unusually long wait times not just attempting to matchmake, but with all operations.  We have investigated this in detail and it is currently unclear to us and support why our writes are being queued in this way but we are working towards a root cause.

This issue does not recover and the db process soon becomes unresponsive, at which point we need to perform a manual primary failover in order to restore functionality. During these outages this procedure was being repeated multiple times per hour. Each failover causing a brief window of matchmaking instability followed by recovery.

MCP thread configuration

Prior to the launch of Fortnite, we had made a change to the packaging of the MCP. As part of that we introduced a bug limiting the number of available service threads below what we considered to be a safe default for our scale at the time. As part of a recent performance pass, this mistake was corrected by reverting it to our previous intended value.

However once deployed to our live environment, we noticed requests experiencing increased latency (double ms average to double seconds) that was not present in our pre-production environments. This was diagnosed as db connection pool starvation via real-time cpu sampling through a diagnostics endpoint. In order to quickly remediate the issue we rolled back to our previous thread pool configuration.

What we expected to be a performance improvement resulted in the opposite and was only revealed at peak production workloads. 

The above MCP issues on Saturday can be seen here, with spikes partially representing matchmaking db failures and overall poor performance due to db thread pool starvation and a gradual rolling deploy.

pasted-image-0-(5).png

The impact of just matchmaking db failures on Sunday can be seen below:

pasted-image-0-(1).png

Account Service outage

Account Service is the core Epic service which maintains user account data and serves as an authentication endpoint. Service in numbers:
  • Throughput: 40k - 100k req/sec (2.5M - 6M rpm). Up to 160k/sec at land rush.
  • Latency:
    • All APIs: avg < 10ms, p99 < 100ms, p99.9 < 400ms
    • Sign In: avg < 100ms
    • Auth check: avg < 5ms
  • Sign up: 1k - 3k a minute.
  • Sign in/out: 0.5k - 1k fresh sign ins /sec . Plus 0.7k - 1.5k refreshes /sec (session extension).

Account Service is a complex application with a JAXRS-based web-service component and a number of sidecar processes, one of which is an Nginx proxy sitting in front of it.

0_0.png
The main purpose of this proxy is to shortcut an access token verification path. All traffic is routed through this proxy, but only access token verification traffic is checked against cache as shown above. With all other calls simply passing through to the main application.

On Sunday, there was an incident when Memcached instability saturated Nginx capacity (essentially, occupied all available worker threads), so that other traffic simply couldn’t get through to the main application.

pasted-image-0-(2).png

Below is a quick post-mortem summary:
  • Timeline:
    • 2018-02-04 18:30 UTC - 2018-02-04 20:20 UTC
  • Root cause:
    • Nginx in front of the Java application was saturated, and traffic was limited to the application. Hence all our JVM-level protection from situations like this didn’t help.
  • Incident Details:
    • Our memcached component started failing under the load due to network and connection saturation.
    • Nginx was next in line and got stuck on timing out Memcached calls (100ms) quickly running out of free worker threads and becoming unable to serve other traffic. Including health-check calls.
    • All the verify calls have fallen through to JVM layer with added +100ms memcached timeouts and added to Redis load.
    • Missed health check requests caused the load balancer to pull all the nodes out of rotation effectively imposing full service downtime.
  • Impact:
    • Good news though is that this had moderate impact on players in match due to resiliency measures we have in place.
    • Signing in (and out) was mostly blocked and our Epic Games Launcher would sign players out with an error.
  • Next steps:
    • Leveraging Level 7 routing in ALB to direct all non-verify traffic directly to our Java application, which in turn has a number of protection measures implemented against saturation like this.
    • Significantly increasing Memcached capacity.
    • Followed by removing Nginx + Memcached couple altogether ou