LINX LON1 Outage – March 2021
23rd March 2021
Between 11:15-11:28 and again between 11:41-11:57 there was a degradation of service on our LON1 exchange affecting a small number of members.
Our technology support team began investigating these incidents in concert with our switch vendor’s Technical Assistance Centre (TAC). Due to the very limited impact, these incidents were not deemed a major incident at that time.
At the TAC’s suggestion, we activated some debug options on affected hardware to assist with diagnosis. Shortly afterwards, at 15:32, we experienced a significant loss of traffic. This was quickly mitigated, and the debug information collected is being used by the vendor TAC for diagnosis. At the time of writing LON1 is operating stably.
Our second London IXP, LON2, is not affected by this incident, nor are any other LINX-operated IXPs in other locations.
Investigation of this incident continues in conjunction with the vendor.
Non-standard procedures currently in operation:
1. Our incident management process has been invoked and we are treating this as a major incident.
2. We are providing members with detailed technical updates hourly via ops-announce.
3. We have suspended provisioning of new connections to LON1 until this incident is closed.
Following the incident yesterday, our LON1 IXP in London continues to perform stably. Our switch vendor is continuing root cause analysis. While this continues, our suspension of provisioning for new connections on LON1 remains in place.All other LINX IXPs are unaffected by yesterday’s incident and are operating under normal procedures.
24th March 2021: 10:30
We are setting a maintenance window on our LON1 IXP in London. The purpose of this maintenance is to perform intrusive testing of one switch in an attempt to identify the cause of the original incident yesterday. There are ten LINX members connected to this switch, and we are contacting them each individually so that they may prepare for any impact and manage traffic flows gracefully. We do not expect this maintenance to impair the operation of LON1 as a whole, but any such intervention does come with some additional risk.
The planned maintenance window is 11:00-11:30, but the actual time required for testing may be much shorter, depending on results.
Outage: Equinix LD8 Reachability – 18 August 2020
At approximately 04:30BST Equinix lost power to their LD8 datacentre, which also hosts one of the eleven points of presence for LINX LON1 and LON2 peering LANs. As a result all LINX members connected to those LANs at LD8 will have lost connection to LON1 and/or LON2 when they lost power to their own equipment. Simultaneously we lost power to our own A and B power feeds and subsequently our equipment within LD8.
We anticipate that approximately 150 LINX members will have been directly affected by this incident. Additionally, LINX members located in other facilities may have lost or impaired interconnection with those members who are directly affected.Once power to the building is restored we will start bringing equipment back online.
11.46: Power is now being restored to the site and LINX equipment is in the process of being brought live again.13:42: All of LINX devices have been restored. It is still at risk as Equinix continue to work on power in the LINX suite and other areas of the data centre.
Equinix have confirmed that complete and stable power was restored to the whole site by 21:49 on August 18th.
LINX has continued to work through the night with our switch vendor’s Technical Assistance Centre as they investigate detailed logging collected in yesterday’s intervention on one switch connected to the LON1 IXP in London. The team has now been successfully reproducing the issue under laboratory conditions. Root cause analysis continues.
While this continues, our suspension of provisioning of new connections to LON1 remains in place.
Detailed technical updates are being shared with members through the normal e-mail channel, ops-announce.
LON2, our other London IXP, is not affected by the issue, nor are LINX operated IXPs in other cities.
25th March 2021: 14:10
Further to the incident we experience on 23rd March on our LON1 IXP in London, the IXP continues to perform stably. Our switch vendor’s Technical Assistance Centre is now able to consistently reproduce the issue in their simulated recreation of our network, and to consistently resolve it.
Our focus has now moved from diagnosis to remediation: we are now building a remediation plan to remove the newly identified source of risk from the switches on LON1 so that we may resume normal working procedures.
25th March 2021: 17.45
Working with our switch vendor’s Technical Assistance Centre we have now completed implementing and testing the recommended fix to resolve the issue that caused the incident on 23rd March. Our LON1 IXP in London is no longer considered ‘at risk’, and we are calling a close to this as a major incident.
We would like to thank all members for their patience and understanding as we worked to resolve this issue, and especially to the ten members who supported us for the maintenance at 11am on 24th March. To those members who shut down peering sessions gracefully to steer traffic away from LON1 during the incident, we welcome you to restart your sessions when possible and thank those members that utilised the LON2 network to maintain overall service.
We will publish detailed technical post incident analysis to members through the usual channels in due course.
We are now able to accept provisioning of new connections on our LON1 IXP.