Couchsurfing.com nearly went from 70,000 users to outa-business in one foul moment due to an (inadvertent) lack of backups. Only due to the giving nature of their userbase who said, “We refuse to let you die!” did they not disappear in a flash.
Below is a copy of the details of the failure
A technical explanation of the crash by Casey Fenton
On June 27th at around 10pm, we were working on databases on the core CS database server. The main database disappeared and MySQL generated an error messages that seemed to be related to the file system. Immediately before this we had issued a drop table command on a small test database and it seemed like both the test and real databases disappeared right at this time. I immediately remotely accessed the overseas-based server and unmounted the partition in question, stopped the MySQL server, and started to investigate. It was not immediately clear what was going on so we sent an emergency trouble message to the professional system administrators we had hired to manage the CouchSurfing servers. They soon logged into the core database server and attempted to find out what was the problem. They got in touch with the data-center where the servers are co-located. When the data center went to check out the server they discovered that one of the four RAID-10 drives was degraded. A few weeks prior the dat a center alerted us that another of the drives had alarmed and was degraded, so they removed the drive, but, as we discovered later, they had forgotten to replace it. The RAID was now operating on three of four drives. The techs at the data center stopped the server and rebooted it with a Helix Data Recovery CD and were able somehow to low-level dump partitions from the RAID array to an external 260 GB drive. They then removed the faulty drive, fixed the RAID array, and restarted the server. They tried to recover data from the image but didn’t have any luck.
Until that point, all of this data recovery up took about 36 long, nail-biting hours. While the techs at the data center were trying to recover the drives, we looked into other options of restoration. At first we assumed that we would be able to recover from a backup about 24 hours old but we soon discovered that the backup server only had saved backups about a week old. Since that older backup was all that was available we decided to use it and start the restoration process. It was only when we instructed our hired system administrators to begin copying the data off of the backup server that we discovered that at least 15 of about 100 tables were not in the backup set. Upon more investigation we discovered that the system administrators we employed had switched backup methods a few weeks prior. Since the database server had been running slow for many hours a day and members were complaining we asked the system administrators to do whatever they could to lessen the backup load on the server. They responded that they had changed the backup method so that there should now be minimal load on the server. We were on the road, traveling to Montreal at this time and did not have time to double-check the validity of the work they had assured us they were doing correctly. During the crash we figured out that the backup method that they had switched to was known as a rsync (file synchronization). They were copying the raw database data files to the remote backup server, which is a highly inferior way of backing up a database. Additionally, they had mis-configured rsync in one important way.
The CouchSurfing MySQL database included two types of data tables. One type is called MyISAM and is for larger pieces of data that doesn’t need to be accessed at high rates. MyISAM tables are smaller. Most of the 100 tables were of this type and were being backed up by rsync. The other table type is called InnoDB. This table type takes twice as much storage space as compared to MyISAM, but has the advantage that the data can be accessed by many server processes at once. The two table types were stored in different locations. The system administrators had been rsync-ing the MyISAM tables but not the InnoDB tables! The 15 InnoDB tables stored the most accessed information in the CouchSurfing website including profiles, friend links, references, etc.
The loss of these tables, with no recent backup signaled the end of the CouchSurfing.com website as we knew it. There was no uniform recent backup. The most recent backup we had was taken just a few days before the crash, but it was a copy of the backup server’s files. These were the incomplete backup files that were a week old. While taking inventory and see what other backup information we had we discovered a mixture with most of the important files being a couple of months old or more.
Thirty-six hours after the crash, the data center informed us that they tried to recover the data but there was nothing that could be done. The data was not recoverable. With no recoverable data and no recent backups for the most important CS database tables, the website seemed to be irreparable. This was the time that I decided to make that announcement that CouchSurfing.com was finished. I wrote the letter explaining what happened to the community and posted it on the website, approximately 48 hours after the crash happened.
What happened next was unbelievable. Within the following 24 hours we received more than 2000 emails of support from members expressing that they could simply not accept the demise of CouchSurfing, they wanted to help bring it back, and would have no problem re-entering their profile information. Many users expressed that they didn’t mind if the databases were zeroed out and the community completely started from scratch. I was reminded that the CS community is not about the data, or about the furniture, it is about the network and the friendships that have already been created. The data was dead, but thee community was alive.
On Friday, June 30th, I left the Montreal Collective to remove myself from the intensity and take some time to reflect upon the recent events. A good night’s sleep and some of Aldo’s coffee revived me and I began to read many of the 2000+ emails that came pouring into CouchSurfing. It was clear that CS could not die. The community would do whatever it took to carry on. At about that same time in the afternoon the data center contacted us and indicated that they were trying to recover the data again. Apparently they had seen the letter I sent and wanted to do whatever it took to make sure that CouchSurfing.com didn’t die. They assured us that they were working with data forensics experts to maximize the chances of recovering the data. As of the time of this writing, they report that they are still attempting to recover the data. We should know in a week if this is possible.
That evening, with the support of the community, I started to develop a plan. We decided that it would be worth it to continue to develop CouchSurfing.com if the community would be willing to participate in an even deeper way and take on the majority of the workload. It was apparent that I just couldn’t do all of the work myself. The plan was to gather as much data as we could and re-launch the site as soon as possible. The rest is described in this section of the website.
So, what exactly did we lose?
Only database data was lost. No website code was damaged. The framework of the site, the functions and features, are intact. We had about 90,000 profiles in the database before the crash. All those usernames and passwords were recovered from the chat server. The chat server keeps usernames and passwords mirrored from the core database server, which are synchronized at all times. Working for many days, I was able to stitch together data from earlier back-ups of various sources and recreate an older version of the database. Approximately 11,000 profiles were lost, most of which were empty profiles. Just before the relaunch, I discovered more than 25,000 user profiles in three of the front-end the web server cache directories. This was great news! These 25,000 were all of the profiles (the most active profiles on the site) that were accessed on the website in the final 12 hours leading up to the crash. I was able to reverse engineer the cache and insert some of the lost profiles back into the user profile table, thus re-creating many of the most active (and recently added) CouchSurfing.com user profiles. These profiles should be very up-to-date, up to just moments before the crash happened. There might be some discrepancies, but for the most part, they are the most intact.
We were able to recreate empty “place holders” for those people who profiles didn’t survive the crash. When you log in you will receive a message explaining that your profile was lost, but your username and password was recovered and your place held in the database.
Unfortunately some other data was not so lucky. We lost up to several months of references, email, group post, and friend links. We’ve done our best to recreate the data. It was also discovered that the European server had about 36 Gigabytes of image cache data. We were able to transfer these cached images back to the North American database server and re-populate the image table, losing only a minimal amount of user photos.
We will be recovering more data as time goes on. We’ve got the support of some of the best minds forensics data recovery and MySQL administration, including James Day of Wikipedia. As we recover more data, we will either merge it back into the website, or we will hold on to it if it is needed at some time in the future. We will make every effort to recover every possible piece of data that existed prior to the crash.
How are we insuring we never lose data again?
First of all, we have learned so much from this disaster. We will never again assume that the administrators have it under control; we will always check the work. We have retained the professional ‘Gold Support Service’ of MySQL. We have created both an on-site and off-site, rock solid back-up plan that is cascading and redundant. We will always strive to make the most secure CouchSurfing possible.
I want to thank everyone who helped along the way, and continue to toil day and night to restore CouchSurfing. Thanks and big hugs to the community members who immediately started relief efforts for the refugee CSers on the road. These other projects, although sometimes controversial, sought to keep the CouchSurfing network alive and help people recover their profiles. I am eternally grateful for your hours of work, determination, and never-ending spirit. I’m sorry I couldn’t be more responsive you in the chaotic days following the storm. I’d especially like to extend my deepest gratitude to Mike Giddens and Marcus Eder of the CouchSurfing-Phoenix Project. These two demonstrated heroic commitment and dedication, and you deserve a huge pat on the back from not only me, but also the community you have done so much service for.