{"id":487790,"date":"2025-11-13T04:27:11","date_gmt":"2025-11-13T04:27:11","guid":{"rendered":"https:\/\/blog.roblox.com\/?p=39916"},"modified":"2025-11-13T04:27:11","modified_gmt":"2025-11-13T04:27:11","slug":"roblox-return-to-service-10-28-10-31-2021","status":"publish","type":"post","link":"https:\/\/arcader.org\/news\/roblox-return-to-service-10-28-10-31-2021\/","title":{"rendered":"Roblox Return to Service 10\/28-10\/31 2021"},"content":{"rendered":"<p><span style=\"font-weight: 400;\">Starting October 28th and fully resolving on October 31st, Roblox experienced a 73-hour outage.<span style=\"color: #000000;\">\u00b9<\/span><\/span><span style=\"font-weight: 400;\"> Fifty million players regularly use Roblox every day and, to create the experience our players expect, our scale involves hundreds of internal online services. As with any large-scale service, we have service interruptions from time to time, but the extended length of this outage makes it particularly noteworthy. We sincerely apologize to our community for the downtime.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We\u2019re sharing these technical details to give our community an understanding of the root cause of the problem, how we addressed it, and what we are doing to prevent similar issues from happening in the future. We would like to reiterate there was no user data loss or access by unauthorized parties of any information during the incident.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Roblox Engineering and technical staff from HashiCorp combined efforts to return Roblox to service. We want to acknowledge the HashiCorp team, who brought on board incredible resources and worked with us tirelessly until the issues were resolved.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Outage Summary<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">The outage was unique in both duration and complexity. The team had to address a number of challenges in sequence to understand the root cause and\u00a0 bring the service back up.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The outage lasted 73 hours.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The root cause was due to two issues. Enabling a relatively new streaming feature on Consul under unusually high read and write load led to excessive contention and poor performance. In addition, our particular load conditions triggered a pathological performance issue in BoltDB. The open source BoltDB system is used within Consul to manage write-ahead-logs for leader election and data replication.\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">A single Consul cluster supporting multiple workloads exacerbated the impact of these issues.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Challenges in diagnosing these two primarily unrelated issues buried deep in the Consul implementation were largely responsible for the extended downtime.\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Critical monitoring systems that would have provided better visibility into the cause of the outage relied on affected systems, such as Consul. This combination severely hampered the triage process.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">We were thoughtful and careful in our approach to bringing Roblox up from an extended fully-down state, which also took notable time.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">We have accelerated engineering efforts to improve our monitoring, remove circular dependencies in our observability stack, as well as accelerating our bootstrapping process.\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">We are working to move to multiple availability zones and data centers.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">We are remediating the issues in Consul that were the root cause of this event.<\/span><\/li>\n<\/ul>\n<h2><span style=\"font-weight: 400;\">Preamble: Our Cluster Environment and HashiStack<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Roblox\u2019s core infrastructure runs in Roblox data centers. We deploy and manage our own hardware, as well as our own compute, storage, and networking systems on top of that hardware. The scale of our deployment is significant, with over 18,000 servers and 170,000 containers.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In order to run thousands of servers across multiple sites, we leverage a technology suite commonly known as the \u201c<\/span><a href=\"https:\/\/www.hashicorp.com\/resources\/how-we-used-the-hashistack-to-transform-the-world-of-roblox\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">HashiStack<\/span><\/a><span style=\"font-weight: 400;\">.\u201d <\/span><b>Nomad<\/b><span style=\"font-weight: 400;\">, <\/span><b>Consul <\/b><span style=\"font-weight: 400;\">and <\/span><b>Vault <\/b><span style=\"font-weight: 400;\">are the technologies that we use to manage servers and services around the world, and that allow us to orchestrate containers that support Roblox services.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Nomad<\/span> <span style=\"font-weight: 400;\">is used for scheduling work. It decides which containers are going to run on which nodes and on which ports they\u2019re accessible. It also validates container health. All of this data is relayed to a Service Registry, which is a database of IP:Port combinations. Roblox services use the Service Registry to find one another so they can communicate. This process is called \u201cservice discovery.\u201d We use <\/span><b>Consul <\/b><span style=\"font-weight: 400;\">for service discovery, health checks, <\/span><span style=\"font-weight: 400;\">session locking (for HA systems built on-top), and as a KV store.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Consul is deployed as a cluster of machines in two roles. \u201cVoters\u201d (5 machines) authoritatively hold the state of the cluster; \u201cNon-voters\u201d (5 additional machines) are read-only replicas that assist with scaling read requests. At any given time, one of the voters is elected by the cluster as leader. The leader is responsible for replicating data to the other voters and determining if written data has been fully committed.\u00a0 Consul uses an algorithm called <\/span><a href=\"https:\/\/raft.github.io\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Raft<\/span><\/a><span style=\"font-weight: 400;\"> for leader election and <\/span><span style=\"font-weight: 400;\">to distribute state across the cluster in a way that ensures each node in the cluster agrees upon the updates.\u00a0 It is not uncommon for the leader to change via leader election several times throughout a given day.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The following is a recent screenshot of a Consul dashboard at Roblox after the incident.\u00a0 Many of the key operational metrics referenced in this blog post are shown at normal levels.\u00a0 KV Apply time for instance is considered normal at less than 300ms and is 30.6ms in this moment. The Consul leader has had contact with other servers in the cluster in the last 32ms, which is very recent.<\/span><\/p>\n<div id=\"attachment_39927\" style=\"width: 2110px\" class=\"wp-caption aligncenter\"><img decoding=\"async\" aria-describedby=\"caption-attachment-39927\" loading=\"lazy\" class=\"wp-image-39927 size-full\" src=\"https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021.jpg\" alt=\"\" width=\"2100\" height=\"1410\" srcset=\"https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021.jpg 2100w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-1.jpg 300w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-2.jpg 1024w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-3.jpg 768w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-4.jpg 1536w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-5.jpg 2048w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-6.jpg 1920w\" sizes=\"auto, (max-width: 2100px) 100vw, 2100px\" \/><\/p>\n<p id=\"caption-attachment-39927\" class=\"wp-caption-text\">1. Normal Operations of the Consul at Roblox<\/p>\n<\/div>\n<p><span style=\"font-weight: 400;\">In the months leading up to the October incident, Roblox upgraded from Consul 1.9 to <\/span><a href=\"https:\/\/learn.hashicorp.com\/tutorials\/consul\/1-10?in=consul\/new-release\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Consul 1.10<\/span><\/a><span style=\"font-weight: 400;\"> to take advantage of <\/span><a href=\"https:\/\/medium.com\/criteo-engineering\/consul-streaming-whats-behind-it-6f44f77a5175\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">a new streaming feature<\/span><\/a><span style=\"font-weight: 400;\">. This streaming feature is designed to significantly reduce the CPU and network bandwidth needed to distribute updates across large-scale clusters like the one at Roblox.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Initial Detection (10\/28 13:37)<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">On the afternoon of October 28th, V<\/span><span style=\"font-weight: 400;\">ault performance was degraded and a single Consul server had high CPU loa<\/span><span style=\"font-weight: 400;\">d. Roblox engineers began to investigate. At this point players were not impacted.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Early Triage (10\/28 13:37 \u2013 10\/29 02:00)<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">The initial investigation suggested that the Consul cluster that Vault and many other services depend on was unhealthy.\u00a0 Specifically, the Consul cluster metrics showed elevated write latency for the underlying KV store in which Consul stores data. The 50th percentile latency on these operations was typically under 300ms but was now 2 seconds. Hardware issues are not\u00a0 unusual at Roblox\u2019s scale, and Consul can survive hardware failure. However, if hardware is merely slow rather than failing, it can impact overall Consul performance. In this case, the team suspected degraded hardware performance as the root cause and began the process of replacing one of the Consul cluster nodes. This was our first attempt at diagnosing the incident<\/span><b>. <\/b><span style=\"font-weight: 400;\">Around this time, staff from HashiCorp joined Roblox engineers to help with diagnosis and remediation. All references to \u201cthe team\u201d and \u201cthe engineering team\u201d from this point forward refer to both Roblox and HashiCorp staff.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Even with new hardware, Consul cluster performance continued to suffer. At 16:35, the number of online players dropped to 50% of normal.<\/span><\/p>\n<div id=\"attachment_39937\" style=\"width: 1133px\" class=\"wp-caption aligncenter\"><img decoding=\"async\" aria-describedby=\"caption-attachment-39937\" loading=\"lazy\" class=\"wp-image-39937 size-full\" src=\"https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021.png\" alt=\"\" width=\"1123\" height=\"740\" srcset=\"https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021.png 1123w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-11.png 300w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-12.png 1024w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-13.png 768w\" sizes=\"auto, (max-width: 1123px) 100vw, 1123px\" \/><\/p>\n<p id=\"caption-attachment-39937\" class=\"wp-caption-text\">2. CCU during the 16:35 PST Player Drop<\/p>\n<\/div>\n<p><span style=\"font-weight: 400;\">This drop coincided with a significant degradation in system health, which ultimately resulted in a complete system outage. Why? When a Roblox service wants to talk to another service, it relies on Consul to have up-to-date knowledge of the location of the service it wants to talk to. However, if Consul is unhealthy, servers struggle to connect. Furthermore, Nomad and Vault rely on Consul, so when Consul is unhealthy, the system cannot schedule new containers or retrieve production secrets used for authentication. In short, the system failed because Consul was a single point of failure, and Consul was not healthy.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">At this point, the team developed a new theory about what was going wrong: increased traffic. Perhaps Consul was slow because our system reached a tipping point, and the servers on which Consul was running could no longer handle the load? This was our second attempt at diagnosing the root cause of the incident.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Given the severity of the incident, the team decided to replace all the nodes in the Consul cluster with new, more powerful machines. These new machines had 128 cores (a 2x increase) and newer, faster NVME SSD disks. By 19:00, the team migrated most of the cluster to the new machines but the cluster was still not healthy. The cluster was reporting that a majority of nodes were not able to keep up with writes, and the 50th percentile latency on KV writes was still around 2 seconds rather than the typical 300ms or less. <\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Return to Service Attempt #1 (10\/29 02:00 \u2013 04:00)<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">The first two attempts to return the Consul cluster to a healthy state were unsuccessful. We could still see elevated KV write latency as well as a new inexplicable symptom that we could not explain: the Consul leader was regularly out of sync with the other voters.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The team decided to shut down the entire Consul cluster and reset its state using a snapshot from a few hours before &#8211; the beginning of the outage. We understood that this would potentially incur a small amount of system config data loss (<\/span><b>not<\/b><span style=\"font-weight: 400;\"> user data loss). Given the severity of the outage and our confidence that we could restore this system config data by hand if needed, we felt this was acceptable.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We expected that restoring from a snapshot taken when the system was healthy would bring the cluster into a healthy state, but we had one additional concern. Even though Roblox did not have any user-generated traffic flowing through the system at this point, internal Roblox services were still live and dutifully reaching out to Consul <\/span><span style=\"font-weight: 400;\">to learn the location of their dependencies <\/span><span style=\"font-weight: 400;\">and to update their health information. These reads and writes were generating a significant load on the cluster. We were worried that this load might immediately push the cluster back into an unhealthy state even if the cluster reset was successful. To address this concern, we configured <\/span><span style=\"font-weight: 400;\">iptables<\/span><span style=\"font-weight: 400;\"> on the cluster to block access. This would allow us to bring the cluster back up in a controlled way and help us understand if the load we were putting on Consul independent of user traffic was part of the problem.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The reset went smoothly, and initially, the metrics looked good. When we removed the <\/span><span style=\"font-weight: 400;\">iptables<\/span><span style=\"font-weight: 400;\"> block, the service discovery and health check load from the internal services returned as expected. However, Consul performance began to degrade again, and eventually we were back to where we started: 50th percentile on KV write operations was back at 2 seconds. Services that depended on Consul were starting to mark themselves \u201cunhealthy,\u201d and eventually, the system fell back into the now-familiar problematic state. It was now 04:00. There was clearly something about our load on Consul that was causing problems, and over 14 hours into the incident, we still didn\u2019t know what it was.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Return to Service Attempt #2\u00a0 (10\/29 04:00 \u2013 10\/30 02:00)<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">We had ruled out hardware failure. Faster hardware hadn\u2019t helped and, as we learned later, potentially hurt stability. Resetting Consul\u2019s internal state hadn\u2019t helped either. There was no user traffic coming in, yet Consul was still slow. We had leveraged<\/span><span style=\"font-weight: 400;\"> iptables<\/span><span style=\"font-weight: 400;\"> to let traffic back into the cluster slowly. Was the cluster simply getting pushed back into an unhealthy state by the sheer volume of thousands of containers trying to reconnect? This was our third attempt at diagnosing the root cause of the incident<\/span><b>.<\/b><\/p>\n<p><span style=\"font-weight: 400;\">The engineering team decided to reduce Consul usage and then carefully and systematically reintroduce it. To ensure we had a clean starting point, we also blocked remaining external traffic. We assembled an exhaustive list of services that use Consul and rolled out config changes to disable all non-essential usage. This process took several hours due to the wide variety of systems and config change types targeted. Roblox services that typically had hundreds of instances running were scaled down to single digits. Health check frequency was decreased from 60 seconds to 10 minutes to give the cluster additional breathing room. At 16:00 on Oct 29th, over 24 hours after the start of the outage, the team began its second attempt to bring Roblox back online. Once again, the initial phase of this restart attempt looked good, but by 02:00 on Oct 30th, Consul was again in an unhealthy state, this time with significantly less load from the Roblox services that depend on it.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">At this point, it was clear that overall Consul usage was not the only contributing factor to the performance degradation that we first noticed on the 28th. Given this realization, the team again pivoted. Instead of looking at Consul from the perspective of the Roblox services that depend on it, the team started looking at Consul internals for clues.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Research Into Contention (10\/30 02:00 \u2013 10\/30 12:00)<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Over the next 10 hours, the engineering team dug deeper into debug logs and operating system-level metrics.\u00a0 This data showed Consul KV writes getting blocked for long periods of time. In other words, \u201ccontention.\u201dThe cause of the contention was not immediately obvious, but one theory was that the shift from 64 to 128 CPU Core servers early in the outage might have made the problem worse. After looking at the htop data and performance debugging data shown in the screenshots below, the team concluded that it was worth going back to 64 Core servers similar to those used before the outage. The team began to prep the hardware: Consul was installed, operating system configurations triple checked, and the machines readied for service in as detailed a manner as possible. The team then transitioned the Consul cluster back to 64 CPU Core servers, but this change did not help. This was our fourth attempt at diagnosing the root cause of the incident.<\/span><\/p>\n<div id=\"attachment_39957\" style=\"width: 2566px\" class=\"wp-caption aligncenter\"><img decoding=\"async\" aria-describedby=\"caption-attachment-39957\" loading=\"lazy\" class=\"wp-image-39957 size-full\" src=\"https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-6.png\" alt=\"\" width=\"2556\" height=\"1310\" srcset=\"https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-6.png 2556w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-14.png 300w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-15.png 1024w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-16.png 768w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-17.png 1536w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-18.png 2048w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-19.png 1920w\" sizes=\"auto, (max-width: 2556px) 100vw, 2556px\" \/><\/p>\n<p id=\"caption-attachment-39957\" class=\"wp-caption-text\">3. We then displayed this with a perf report as shown above. The majority of time was spent in kernel spin locks via the Streaming subscription code path.<\/p>\n<\/div>\n<div id=\"attachment_39967\" style=\"width: 4638px\" class=\"wp-caption aligncenter\"><img decoding=\"async\" aria-describedby=\"caption-attachment-39967\" loading=\"lazy\" class=\"wp-image-39967 size-full\" src=\"https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-7.png\" alt=\"\" width=\"4628\" height=\"1824\" srcset=\"https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-7.png 4628w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-20.png 300w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-21.png 1024w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-22.png 768w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-23.png 1536w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-24.png 2048w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-25.png 1920w\" sizes=\"auto, (max-width: 4628px) 100vw, 4628px\" \/><\/p>\n<p id=\"caption-attachment-39967\" class=\"wp-caption-text\">4. HTOP showing CPU Usage across 128 cores.<\/p>\n<\/div>\n<h2><span style=\"font-weight: 400;\">Root Causes Found (10\/30 12:00 \u2013 10\/30 20:00)<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Several months ago, we enabled a new Consul streaming feature on a subset of our services.\u00a0 This feature, designed to lower the CPU usage and network bandwidth of the Consul cluster, worked as expected, so over the next few months we incrementally enabled the feature on more of our backend services. On October 27th at 14:00, one day before the outage, we enabled this feature on a backend service that is responsible for traffic routing. As part of this rollout, in order to prepare for the increased traffic we typically see at the end of the year, we also increased the number of nodes supporting traffic routing by 50%.\u00a0 The system had worked well with streaming at this level for a day before the incident started, so it wasn&#8217;t initially clear why it&#8217;s performance had changed. However through analysis of perf reports and flame graphs from Consul servers, we saw evidence of streaming code paths being responsible for the contention causing high CPU usage. We disabled the streaming feature for all Consul systems, including the traffic routing nodes. The config change finished propagating at 15:51, at which time the 50th percentile for Consul KV writes lowered to 300ms. We finally had a breakthrough.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Why was streaming an issue?\u00a0 HashiCorp explained that, while streaming was overall more efficient, it used fewer concurrency control elements (Go channels) in its implementation than long polling.\u00a0 Under very high load \u2013 specifically, both a very high read load and a very high write load \u2013 the design of streaming exacerbates the amount of contention on a single Go channel,\u00a0 which causes blocking during writes, making it significantly less efficient. This behavior also explained the effect of higher core-count servers: those servers were dual socket architectures with a NUMA memory model.\u00a0 The additional contention on shared resources thus got worse under this architecture.\u00a0 By turning off streaming, we dramatically improved the health of the Consul cluster.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Despite the breakthrough, we were not yet out of the woods. We saw Consul intermittently electing new cluster leaders, which was normal, but we also saw some leaders exhibiting the same latency problems we saw before we disabled streaming, which was not normal. Without any obvious clues pointing to the root cause of the slow leader problem, and with evidence that the cluster was healthy as long as certain servers were not elected as the leaders, the team made the pragmatic decision to work around the problem by preventing the problematic leaders from staying elected. This enabled the team to focus on returning the Roblox services that rely on Consul to a healthy state.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">But what was going on with the slow leaders? We did not figure this out during the incident, but HashiCorp engineers determined the root cause in the days after the outage. Consul uses a popular open-source persistence library named BoltDB to store Raft logs. It is <\/span><i><span style=\"font-weight: 400;\">not <\/span><\/i><span style=\"font-weight: 400;\">used to store the current state within Consul, but rather a rolling log of the operations being applied. To prevent BoltDB from growing indefinitely, Consul regularly performs snapshots. The snapshot operation writes the current state of Consul to disk and then deletes the oldest log entries from BoltDB.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">However, due to the design of BoltDB, even when the oldest log entries are deleted, the space BoltDB uses on disk never shrinks. Instead, all the pages (4kb segments within the file) that were used to store deleted data are instead marked as &#8220;free&#8221; and re-used for subsequent writes. BoltDB tracks these free pages in a structure called its &#8220;freelist.&#8221; Typically, write latency is not meaningfully impacted by the time it takes to update the freelist, but Roblox\u2019s workload <\/span><span style=\"font-weight: 400;\">exposed a pathological performance issue in BoltDB that made freelist maintenance extremely expensive.<\/span><span style=\"font-weight: 400;\">\u00a0<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Restoring Caching Service (10\/30 20:00 \u2013 10\/31 05:00)<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">It had been 54 hours since the start of the outage. With streaming disabled and a process in place to prevent slow leaders from staying elected, Consul was now consistently stable. The team was ready to focus on a return to service.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Roblox uses a typical microservices pattern for its backend. At the bottom of the microservices \u201cstack\u201d are databases and caches. These databases were unaffected by the outage, but the caching system, which regularly handles 1B requests-per-second across its multiple layers during regular system operation, was unhealthy. Since our caches store transient data that can easily repopulate from the underlying databases, the easiest way to bring the caching system back into a healthy state was to redeploy it.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The cache redeployment process ran into a series of issues:\u00a0<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Likely due to the Consul cluster snapshot reset that had been performed earlier on, internal scheduling data that the cache system stores in the Consul KV were incorrect.\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Deployments of small caches were taking longer than expected to deploy, and deployments of large caches were not finishing. It turned out that there was an unhealthy node that the job scheduler saw as completely open rather than unhealthy. This resulted in the job scheduler attempting to aggressively schedule cache jobs on this node, which failed because the node was unhealthy.\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The caching system\u2019s automated deployment tool was built to support incremental adjustments to large scale deployments that were already handling traffic at scale, not iterative attempts to bootstrap a large cluster from scratch.\u00a0<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">The team worked through the night to identify and address these issues, ensure cache systems were properly deployed, and verify correctness. At 05:00 on October 31, 61 hours since the start of the outage, we had a healthy Consul cluster and a healthy caching system. We were ready to bring up the rest of Roblox.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">The Return of Players (10\/31 05:00 \u2013 10\/31 16:00)<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">The final return to service phase began officially at 05:00 on the 31st.\u00a0 Similar to the caching system, a significant portion of running services had been shut down during the initial outage or the troubleshooting phases.\u00a0 The team needed to restart these services at correct capacity levels and verify that they were functioning correctly. This went smoothly, and by 10:00, we were ready to open up to players.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">With cold caches and a system we were still uncertain about, we did not want a flood of traffic that could potentially put the system back into an unstable state.\u00a0 To avoid a flood, we used DNS steering to manage the number of players who could access Roblox. This allowed us to let in a certain percentage of randomly selected players while others continued to be redirected to our static maintenance page. Every time we increased the percentage, we checked database load, cache performance, and overall system stability. Work continued throughout the day, ratcheting up access in roughly 10% increments. We enjoyed seeing some of our most dedicated players figure out our DNS steering scheme and start exchanging this information on Twitter so that they could get \u201cearly\u201d access as we brought the service back up. At 16:45 Sunday, 73 hours after the start of the outage, 100% of players were given access and Roblox was fully operational.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Further Analysis and Changes Resulting from the Outage<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">While players were allowed to return to Roblox on October 31st, Roblox and HashiCorp continued refining their understanding of the outage throughout the following week. Specific contention issues in the new streaming protocol were identified and isolated. While HashiCorp had <\/span><a href=\"https:\/\/www.hashicorp.com\/cgsb\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">benchmarked streaming<\/span><\/a><span style=\"font-weight: 400;\"> at similar scale to Roblox usage, they had not observed this specific behavior before due to it manifesting from a combination of both a large number of streams and a high churn rate. The HashiCorp engineering team is creating new laboratory benchmarks to reproduce the specific contention issue and performing additional scale tests. HashiCorp is also working to improve the design of the streaming system to avoid contention under extreme load and ensure stable performance in such conditions.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Further analysis of the slow leader problem also uncovered the key cause of the two-second Raft data writes and cluster consistency issues. Engineers looked at flame graphs like the one below to get a better understanding of the inner workings of BoltDB. <\/span><\/p>\n<div id=\"attachment_39977\" style=\"width: 4472px\" class=\"wp-caption aligncenter\"><img decoding=\"async\" aria-describedby=\"caption-attachment-39977\" loading=\"lazy\" class=\"wp-image-39977 size-full\" src=\"https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-8.png\" alt=\"\" width=\"4462\" height=\"1594\" srcset=\"https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-8.png 4462w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-26.png 300w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-27.png 1024w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-28.png 768w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-29.png 1536w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-30.png 2048w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-31.png 1920w\" sizes=\"auto, (max-width: 4462px) 100vw, 4462px\" \/><\/p>\n<p id=\"caption-attachment-39977\" class=\"wp-caption-text\">5. BoltDB freelist operations analysis.<\/p>\n<\/div>\n<p><span style=\"font-weight: 400;\">As previously mentioned, Consul uses a persistence library called BoltDB to store Raft log data. Due to a specific usage pattern created during the incident, 16kB write operations were instead becoming much larger. You can see the problem illustrated in these screenshots:<\/span><\/p>\n<div id=\"attachment_39987\" style=\"width: 630px\" class=\"wp-caption aligncenter\"><img decoding=\"async\" aria-describedby=\"caption-attachment-39987\" loading=\"lazy\" class=\"wp-image-39987\" src=\"https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-9.png\" alt=\"\" width=\"620\" height=\"740\" \/><\/p>\n<p id=\"caption-attachment-39987\" class=\"wp-caption-text\">6. Detailed BoldDB statistics used in analysis.<\/p>\n<\/div>\n<p><span style=\"font-weight: 400;\">The preceding command output tells us a number of things:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">This 4.2GB log store is only storing 489MB of actual data (including all the index internals). <\/span><b>3.8GB is &#8220;empty&#8221; space.<\/b><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The <\/span><b>freelist is 7.8MB <\/b><span style=\"font-weight: 400;\">since it contains nearly a million free page ids. <\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">That means, for every log append (each Raft write after some batching), a new 7.8MB freelist was also being written out to disk even though the actual raw data being appended was 16kB or less.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Back pressure on these operations also created full TCP buffers and contributed to 2-3s write times on unhealthy leaders. The image below shows research into TCP Zero Windows during the incident. <\/span><\/p>\n<div id=\"attachment_39997\" style=\"width: 930px\" class=\"wp-caption aligncenter\"><img decoding=\"async\" aria-describedby=\"caption-attachment-39997\" loading=\"lazy\" class=\"size-full wp-image-39997\" src=\"https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-10.png\" alt=\"\" width=\"920\" height=\"378\" srcset=\"https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-10.png 920w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-32.png 300w, https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021-33.png 768w\" sizes=\"auto, (max-width: 920px) 100vw, 920px\" \/><\/p>\n<p id=\"caption-attachment-39997\" class=\"wp-caption-text\">7. Research into TCP zero windows. When a TCP receiver&#8217;s buffer begins to fill, it can reduce its receive window. If it fills, it can reduce the window to zero, which tells the TCP sender to stop sending.<\/p>\n<\/div>\n<p><span style=\"font-weight: 400;\">HashiCorp and Roblox have developed and deployed a process using existing BoltDB tooling to \u201ccompact\u201d the database, which resolved the performance issues.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Recent Improvements and Future Steps<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">It has been 2.5 months since the outage. What have we been up to? We used this time to learn as much as we could from the outage, to adjust engineering priorities based on what we learned, and to aggressively harden our systems. One of our Roblox values is Respect The Community, and while we could have issued a post sooner to explain what happened, we felt we owed it to you, our community, to make significant progress on improving the reliability of our systems before publishing.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The full list of completed and in-flight reliability improvements is too long and too detailed for this write-up, but here are the key items:<\/span><\/p>\n<p><b>Telemetry Improvements<\/b><\/p>\n<p><span style=\"font-weight: 400;\">There was a circular dependency between our telemetry systems and Consul, which meant that when Consul was unhealthy, we lacked the telemetry data that would have made it easier for us to figure out what was wrong. We have removed this circular dependency. Our telemetry systems no longer depend on the systems that they are configured to monitor.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We have extended our telemetry systems to provide better visibility into Consul and BoltDB performance. We now receive highly targeted alerts if there are any signs that the system is approaching the state that caused this outage. We have also extended our telemetry systems to provide more visibility into the traffic patterns between Roblox services and Consul. This additional visibility into the behavior and performance of our system at multiple levels has already helped us during system upgrades and debugging sessions.<\/span><\/p>\n<p><b>Expansion Into Multiple Availability Zones and Data Centers<\/b><\/p>\n<p><span style=\"font-weight: 400;\">Running all Roblox backend services on one Consul cluster left us exposed to an outage of this nature. We have already built out the servers and networking for an additional, geographically distinct data center that will host our backend services. We have efforts underway to move to multiple availability zones within these data centers; we have made major modifications to our engineering roadmap and our staffing plans in order to accelerate these efforts.<\/span><\/p>\n<p><b>Consul Upgrades and Sharding<\/b><\/p>\n<p><span style=\"font-weight: 400;\">Roblox is still growing quickly, so even with multiple Consul clusters, we want to reduce the load we place on Consul. We have reviewed how our services use Consul\u2019s KV store and health checks, and have split some critical services into their own dedicated clusters, reducing load on our central Consul cluster to a safer level.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Some core Roblox services are using Consul\u2019s KV store directly as a convenient place to store data, even though we have other storage systems that are likely more appropriate. We are in the process of migrating this data to a more appropriate storage system. Once complete, this will also reduce load on Consul.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We discovered a large amount of obsolete KV data. Deleting this obsolete data improved Consul performance.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We are working closely with HashiCorp to deploy a new version of Consul that replaces BoltDB with a successor called <\/span><a href=\"https:\/\/github.com\/etcd-io\/bbolt\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">bbolt<\/span><\/a><span style=\"font-weight: 400;\"> that does not have the same issue with unbounded freelist growth. We intentionally postponed this effort into the new year to avoid a complex upgrade during our peak end-of-year traffic. The upgrade is being tested now and will complete in Q1.<\/span><\/p>\n<p><b>Improvements To Bootstrapping Procedures and Config Management<\/b><\/p>\n<p><span style=\"font-weight: 400;\">The return to service effort was slowed by a number of factors, including the deployment and warming of caches needed by Roblox services. We are developing new tools and processes to make this process more automated and less error-prone.\u00a0 In particular, we have redesigned our cache deployment mechanisms to ensure we can quickly bring up our cache system from a standing start. Implementation of this is underway.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We worked with HashiCorp to identify several Nomad enhancements that will make it easier for us to turn up large jobs after a long period of unavailability. These enhancements will be deployed as part of our next Nomad upgrade, scheduled for later this month.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We have developed and deployed mechanisms for faster machine configuration changes.<\/span><\/p>\n<p><b>Reintroduction of Streaming<\/b><\/p>\n<p><span style=\"font-weight: 400;\">We originally deployed streaming to lower the CPU usage and network bandwidth of the Consul cluster. Once a new implementation has been tested at our scale with our workload, we expect to carefully reintroduce it to our systems.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">A Note on Public Cloud<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">In the aftermath of an outage like this, it\u2019s natural to ask if Roblox would consider moving to public cloud and letting a third party manage our foundational compute, storage, and networking services.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Another one of our Roblox values is Take The Long View, and this value heavily informs our decision-making. We build and manage our own foundational infrastructure on-prem because, at our current scale, and more importantly, the scale that we know we\u2019ll reach as our platform grows, we believe it\u2019s the best way to support our business and our community. Specifically, by building and managing our own data centers for backend and network edge services, we have been able to significantly control costs compared to public cloud. These savings directly influence the amount we are able to pay to creators on the platform. Furthermore, owning our own hardware and building our own edge infrastructure allows us to minimize performance variations and carefully manage the latency of our players around the world. Consistent performance and low latency are critical to the experience of our players, who are not necessarily located near the data centers of public cloud providers.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Note that we are not ideologically wedded to any particular approach: we use public cloud for\u00a0 use cases where it makes the most sense for our players and developers. As examples, we use public cloud for burst capacity, large portions of our DevOps workflows, and most of our in-house analytics. In general we find public cloud to be a good tool for applications that are not performance and latency critical, and that run at a limited scale. However, for our most performance and latency critical workloads, we have made the choice to build and manage our own infrastructure on-prem. We made this choice knowing that it takes time, money, and talent, but also knowing that it will allow us to build a better platform. This is consistent with our Take The Long View value.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">System Stability Since The Outage<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Roblox typically receives a surge of traffic at the end of December. We have a lot more reliability work to do, but we are pleased to report that Roblox did not have a single significant production incident during the December surge, and that the performance and stability of both Consul and Nomad during this surge were excellent. It appears that our immediate reliability improvements are already paying off, and as our longer term projects wrap up we expect even better results.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Closing Thoughts<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">We want to thank our global Roblox community for their understanding and support.\u00a0 Another one of our Roblox values is Take Responsibility, and we take full responsibility for what happened here. We would like to once again extend our heartfelt thanks to the team at HashiCorp. Their engineers jumped in to assist us at the outset of this unprecedented outage and did not leave our side. Even now, with the outage two months behind us, Roblox and HashiCorp engineers continue to collaborate closely to ensure we\u2019re collectively doing everything we can to prevent a similar outage from ever happening again.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Finally, we want to thank our Roblox colleagues for validating why this is an amazing place to work. At Roblox we believe in civility and respect. It\u2019s easy to be civil and respectful when things are going well, but the real test is how we treat one another when things get difficult. At some point during a 73-hour outage, with the clock ticking and the stress building, it wouldn\u2019t be surprising to see someone lose their cool, say something disrespectful, or wonder aloud whose fault this all was. But that\u2019s not what happened. We supported one another, and we worked together as one team around the clock until the service was healthy. We are, of course, not proud of this outage and the impact it had on our community, but we <\/span><b><i>are<\/i><\/b><span style=\"font-weight: 400;\"> proud of how we came together as a team to bring Roblox back to life, and how we treated each other with civility and respect at every step along the way.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We have learned tremendously from this experience, and we are more committed than ever to make Roblox a stronger and more reliable platform going forward.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Thank you again.\u00a0<\/span><\/p>\n<hr \/>\n<p><span style=\"font-weight: 400;\"> \u00b9 <\/span><span style=\"font-weight: 400;\">Note all dates and time in this blog post are in Pacific Standard Time (PST).<\/span><\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/blog.roblox.com\/2022\/01\/roblox-return-to-service-10-28-10-31-2021\/\">Roblox Return to Service 10\/28-10\/31 2021<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/blog.roblox.com\">Roblox Blog<\/a>.<\/p>\n<p> <a href=\"https:\/\/blog.roblox.com\/2022\/01\/roblox-return-to-service-10-28-10-31-2021\/\">Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Starting October 28th and fully resolving on October 31st, Roblox experienced a 73-hour outage.\u00b9 Fifty million players regularly use Roblox every day and, to create the experience our players expect, our scale involves hundreds of internal online services. As with any large-scale service, we have service interruptions from time to time, but the extended length of this outage makes it particularly noteworthy. We sincerely apologize to our community for the downtime. We\u2019re sharing these technical details to give our community an understanding of the root cause of the problem, how we addressed it, and what we are doing to prevent similar issues from happening in the future. We would like to reiterate there was no user data loss or access by unauthorized parties of any&hellip;<\/p>\n<p class=\"excerpt-more\"><a class=\"blog-excerpt button\" href=\"https:\/\/arcader.org\/news\/roblox-return-to-service-10-28-10-31-2021\/\">Read More&#8230;<\/a><\/p>\n","protected":false},"author":1,"featured_media":487791,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[298],"tags":[299],"class_list":["post-487790","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-roblox","tag-product-tech"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Roblox Return to Service 10\/28-10\/31 2021 | Arcader News<\/title>\n<meta name=\"description\" content=\"Starting October 28th and fully resolving on October 31st, Roblox experienced a 73-hour outage.\u00b9 Fifty million players regularly use Roblox every day and,\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/arcader.org\/news\/roblox-return-to-service-10-28-10-31-2021\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Roblox Return to Service 10\/28-10\/31 2021 | Arcader News\" \/>\n<meta property=\"og:description\" content=\"Starting October 28th and fully resolving on October 31st, Roblox experienced a 73-hour outage.\u00b9 Fifty million players regularly use Roblox every day and,\" \/>\n<meta property=\"og:url\" content=\"https:\/\/arcader.org\/news\/roblox-return-to-service-10-28-10-31-2021\/\" \/>\n<meta property=\"og:site_name\" content=\"Arcade News\" \/>\n<meta property=\"article:published_time\" content=\"2025-11-13T04:27:11+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"480\" \/>\n\t<meta property=\"og:image:height\" content=\"322\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Arcade News\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Arcade News\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"26 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/arcader.org\\\/news\\\/roblox-return-to-service-10-28-10-31-2021\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/arcader.org\\\/news\\\/roblox-return-to-service-10-28-10-31-2021\\\/\"},\"author\":{\"name\":\"Arcade News\",\"@id\":\"https:\\\/\\\/arcader.org\\\/news\\\/#\\\/schema\\\/person\\\/8460f5e5076b52fb2369f2f7ce6f2839\"},\"headline\":\"Roblox Return to Service 10\\\/28-10\\\/31 2021\",\"datePublished\":\"2025-11-13T04:27:11+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/arcader.org\\\/news\\\/roblox-return-to-service-10-28-10-31-2021\\\/\"},\"wordCount\":5157,\"commentCount\":0,\"image\":{\"@id\":\"https:\\\/\\\/arcader.org\\\/news\\\/roblox-return-to-service-10-28-10-31-2021\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/arcader.org\\\/wp-content\\\/uploads\\\/2022\\\/01\\\/roblox-return-to-service-10-28-10-31-2021.jpg\",\"keywords\":[\"Product &amp; Tech\"],\"articleSection\":[\"Roblox\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/arcader.org\\\/news\\\/roblox-return-to-service-10-28-10-31-2021\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/arcader.org\\\/news\\\/roblox-return-to-service-10-28-10-31-2021\\\/\",\"url\":\"https:\\\/\\\/arcader.org\\\/news\\\/roblox-return-to-service-10-28-10-31-2021\\\/\",\"name\":\"Roblox Return to Service 10\\\/28-10\\\/31 2021 | Arcader News\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/arcader.org\\\/news\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/arcader.org\\\/news\\\/roblox-return-to-service-10-28-10-31-2021\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/arcader.org\\\/news\\\/roblox-return-to-service-10-28-10-31-2021\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/arcader.org\\\/wp-content\\\/uploads\\\/2022\\\/01\\\/roblox-return-to-service-10-28-10-31-2021.jpg\",\"datePublished\":\"2025-11-13T04:27:11+00:00\",\"author\":{\"@id\":\"https:\\\/\\\/arcader.org\\\/news\\\/#\\\/schema\\\/person\\\/8460f5e5076b52fb2369f2f7ce6f2839\"},\"description\":\"Starting October 28th and fully resolving on October 31st, Roblox experienced a 73-hour outage.\u00b9 Fifty million players regularly use Roblox every day and,\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/arcader.org\\\/news\\\/roblox-return-to-service-10-28-10-31-2021\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/arcader.org\\\/news\\\/roblox-return-to-service-10-28-10-31-2021\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/arcader.org\\\/news\\\/roblox-return-to-service-10-28-10-31-2021\\\/#primaryimage\",\"url\":\"https:\\\/\\\/arcader.org\\\/wp-content\\\/uploads\\\/2022\\\/01\\\/roblox-return-to-service-10-28-10-31-2021.jpg\",\"contentUrl\":\"https:\\\/\\\/arcader.org\\\/wp-content\\\/uploads\\\/2022\\\/01\\\/roblox-return-to-service-10-28-10-31-2021.jpg\",\"width\":480,\"height\":322,\"caption\":\"Roblox Return to Service 10\\\/28-10\\\/31 2021\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/arcader.org\\\/news\\\/roblox-return-to-service-10-28-10-31-2021\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/arcader.org\\\/news\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Roblox Return to Service 10\\\/28-10\\\/31 2021\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/arcader.org\\\/news\\\/#website\",\"url\":\"https:\\\/\\\/arcader.org\\\/news\\\/\",\"name\":\"Arcade News\",\"description\":\"Free Arcade News from the Best Online Sources\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/arcader.org\\\/news\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/arcader.org\\\/news\\\/#\\\/schema\\\/person\\\/8460f5e5076b52fb2369f2f7ce6f2839\",\"name\":\"Arcade News\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/3fea48a614d86edd987bc7bb25f4707c69546d4b1f78ad4aa20b26316bad1f9d?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/3fea48a614d86edd987bc7bb25f4707c69546d4b1f78ad4aa20b26316bad1f9d?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/3fea48a614d86edd987bc7bb25f4707c69546d4b1f78ad4aa20b26316bad1f9d?s=96&d=mm&r=g\",\"caption\":\"Arcade News\"},\"sameAs\":[\"https:\\\/\\\/cricketgames.tv\"],\"url\":\"https:\\\/\\\/arcader.org\\\/news\\\/author\\\/arcade-news\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Roblox Return to Service 10\/28-10\/31 2021 | Arcader News","description":"Starting October 28th and fully resolving on October 31st, Roblox experienced a 73-hour outage.\u00b9 Fifty million players regularly use Roblox every day and,","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/arcader.org\/news\/roblox-return-to-service-10-28-10-31-2021\/","og_locale":"en_US","og_type":"article","og_title":"Roblox Return to Service 10\/28-10\/31 2021 | Arcader News","og_description":"Starting October 28th and fully resolving on October 31st, Roblox experienced a 73-hour outage.\u00b9 Fifty million players regularly use Roblox every day and,","og_url":"https:\/\/arcader.org\/news\/roblox-return-to-service-10-28-10-31-2021\/","og_site_name":"Arcade News","article_published_time":"2025-11-13T04:27:11+00:00","og_image":[{"width":480,"height":322,"url":"https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021.jpg","type":"image\/jpeg"}],"author":"Arcade News","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Arcade News","Est. reading time":"26 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/arcader.org\/news\/roblox-return-to-service-10-28-10-31-2021\/#article","isPartOf":{"@id":"https:\/\/arcader.org\/news\/roblox-return-to-service-10-28-10-31-2021\/"},"author":{"name":"Arcade News","@id":"https:\/\/arcader.org\/news\/#\/schema\/person\/8460f5e5076b52fb2369f2f7ce6f2839"},"headline":"Roblox Return to Service 10\/28-10\/31 2021","datePublished":"2025-11-13T04:27:11+00:00","mainEntityOfPage":{"@id":"https:\/\/arcader.org\/news\/roblox-return-to-service-10-28-10-31-2021\/"},"wordCount":5157,"commentCount":0,"image":{"@id":"https:\/\/arcader.org\/news\/roblox-return-to-service-10-28-10-31-2021\/#primaryimage"},"thumbnailUrl":"https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021.jpg","keywords":["Product &amp; Tech"],"articleSection":["Roblox"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/arcader.org\/news\/roblox-return-to-service-10-28-10-31-2021\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/arcader.org\/news\/roblox-return-to-service-10-28-10-31-2021\/","url":"https:\/\/arcader.org\/news\/roblox-return-to-service-10-28-10-31-2021\/","name":"Roblox Return to Service 10\/28-10\/31 2021 | Arcader News","isPartOf":{"@id":"https:\/\/arcader.org\/news\/#website"},"primaryImageOfPage":{"@id":"https:\/\/arcader.org\/news\/roblox-return-to-service-10-28-10-31-2021\/#primaryimage"},"image":{"@id":"https:\/\/arcader.org\/news\/roblox-return-to-service-10-28-10-31-2021\/#primaryimage"},"thumbnailUrl":"https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021.jpg","datePublished":"2025-11-13T04:27:11+00:00","author":{"@id":"https:\/\/arcader.org\/news\/#\/schema\/person\/8460f5e5076b52fb2369f2f7ce6f2839"},"description":"Starting October 28th and fully resolving on October 31st, Roblox experienced a 73-hour outage.\u00b9 Fifty million players regularly use Roblox every day and,","breadcrumb":{"@id":"https:\/\/arcader.org\/news\/roblox-return-to-service-10-28-10-31-2021\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/arcader.org\/news\/roblox-return-to-service-10-28-10-31-2021\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/arcader.org\/news\/roblox-return-to-service-10-28-10-31-2021\/#primaryimage","url":"https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021.jpg","contentUrl":"https:\/\/arcader.org\/wp-content\/uploads\/2022\/01\/roblox-return-to-service-10-28-10-31-2021.jpg","width":480,"height":322,"caption":"Roblox Return to Service 10\/28-10\/31 2021"},{"@type":"BreadcrumbList","@id":"https:\/\/arcader.org\/news\/roblox-return-to-service-10-28-10-31-2021\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/arcader.org\/news\/"},{"@type":"ListItem","position":2,"name":"Roblox Return to Service 10\/28-10\/31 2021"}]},{"@type":"WebSite","@id":"https:\/\/arcader.org\/news\/#website","url":"https:\/\/arcader.org\/news\/","name":"Arcade News","description":"Free Arcade News from the Best Online Sources","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/arcader.org\/news\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/arcader.org\/news\/#\/schema\/person\/8460f5e5076b52fb2369f2f7ce6f2839","name":"Arcade News","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/3fea48a614d86edd987bc7bb25f4707c69546d4b1f78ad4aa20b26316bad1f9d?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/3fea48a614d86edd987bc7bb25f4707c69546d4b1f78ad4aa20b26316bad1f9d?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/3fea48a614d86edd987bc7bb25f4707c69546d4b1f78ad4aa20b26316bad1f9d?s=96&d=mm&r=g","caption":"Arcade News"},"sameAs":["https:\/\/cricketgames.tv"],"url":"https:\/\/arcader.org\/news\/author\/arcade-news\/"}]}},"_links":{"self":[{"href":"https:\/\/arcader.org\/news\/wp-json\/wp\/v2\/posts\/487790","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/arcader.org\/news\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/arcader.org\/news\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/arcader.org\/news\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/arcader.org\/news\/wp-json\/wp\/v2\/comments?post=487790"}],"version-history":[{"count":1,"href":"https:\/\/arcader.org\/news\/wp-json\/wp\/v2\/posts\/487790\/revisions"}],"predecessor-version":[{"id":1390515,"href":"https:\/\/arcader.org\/news\/wp-json\/wp\/v2\/posts\/487790\/revisions\/1390515"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/arcader.org\/news\/wp-json\/wp\/v2\/media\/487791"}],"wp:attachment":[{"href":"https:\/\/arcader.org\/news\/wp-json\/wp\/v2\/media?parent=487790"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/arcader.org\/news\/wp-json\/wp\/v2\/categories?post=487790"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/arcader.org\/news\/wp-json\/wp\/v2\/tags?post=487790"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}