On July 11, 2019 the GitLab service experienced a significant outage. The service was unavailable between 7:15AM and 12:20PM.
What lead to the outage:
An upgrade to GitLab 11.11.5 was scheduled. As part of the upgrade tasks, we apply OS-level patches before beginning the application upgrade process. Some OS-level patches require a restart. In this case, a restart was required. After the patches were installed and reported as having completed successfully, the server was restarted. After the restart, the server was never able to come back up fully and on-screen messages indicated a drive failure/issue.
NOTE: The drive failure did not affect the storage location where the GitLab/Git repositories are stored!
How we recovered:
As part of our recovery procedure, we created a new Virtual Machine, applied all relevant configuration and re-attached the storage where all GitLab/Git repositories are hosted.
NOTE: The drive failure did not affect the storage location where the GitLab/Git repositories are stored! We used the nightly backup of the GitLab database and supporting services to recover the additional configuration needed to bring the service back up.
What You should look out for:
Our backup process runs at midnight. Therefore, and data entered within the User Interface of GitLab (issues, Wikis, etc) between Midnight and 7:15 AM may have been lost. This does NOT apply to code committed to the remote repositories. Any code that has been committed to GitLab should not have been affected and no data should have been lost.
Since we are operating on a new Virtual Server, those of you who use SSH in order to interact with the server will have to update your "known_hosts". In a lot of cases, you will be prompted with a message along the lines of "Remote Host Identification Has Changed" with some additional text regarding a possible Man-in-the-middle attack. This is an expected side-effect of the migration to a new host - please do not be alarmed. To resolve this issue, please refer to the message that you see on the screen and take note of what "known_hosts" file you are being prompted for. These are typically located in your home folder under the ".ssh" hidden folder, but they can be stored elsewhere. Once you know where the know_hosts (or known_hosts2 on some MACs) is located, open that file with a text editor, find any line that has "gitlab.partners.org" or "gitlab.dipr.partners.org" in it and remove it - remove the whole line of text. Then save the file. NOTE: by modifying this file you do NOT risk damaging any configurations you may have for connections to other servers. This file simply contains the fingerprints of servers you have connected to previously.
Please check any automated processes and runners to ensure that they are working as expected. If you experience any issues, please feel free to contact us for assistance at rcc@partners.org with the words GitLab in the somewhere in the subject line.
---------------------
Additional Notes:
Q: Why did it take so long?
A: The outage was prolonged for a number of reasons. Chief among which was the fact that we had not done a full-scale disaster recovery of the service and had to take careful steps to ensure that no data was lost or corrupted.
Q: Are there any up-sides to this crash?
A: Actually, yes. We are now running on new-er Operating System and have more resources provisioned to the machine. We can also now use ED25519 keys for SSH communication with GitLab (as opposed to RSA only previously).
Q: Would the upgrade to GitLab 11.11.5 take place soon?
A: Yes. The upgrade to 11.11.5 will be re-scheduled. A separate notification will be sent out for that when we are sure that the new system is fully validated by you - the users.