{"id":18,"date":"2005-01-15T17:48:03","date_gmt":"2005-01-15T22:48:03","guid":{"rendered":"http:\/\/lee.org\/blog\/archives\/2005\/01\/15\/why-i-run-my-own-blog-livejournal-outage\/"},"modified":"2005-01-25T16:31:46","modified_gmt":"2005-01-25T21:31:46","slug":"why-i-run-my-own-blog-livejournal-outage","status":"publish","type":"post","link":"https:\/\/www.lee.org\/blog\/2005\/01\/15\/why-i-run-my-own-blog-livejournal-outage\/","title":{"rendered":"Why I run my own blog: Livejournal outage"},"content":{"rendered":"<p>This is why: <a href=\"http:\/\/livejournal.com\/\">http:\/\/livejournal.com\/<\/a><\/p>\n<blockquote>\n<h1>Temporarily Unavailable<\/h1>\n<p>LiveJournal is currently down due to a <a href='\/powerloss\/'>massive power failure<\/a> at <a href='http:\/\/www.internap.com\/'>our data center<\/a>.  We&#8217;ll provide updates at <a href='\/powerloss\/'>\/powerloss\/<\/a> as they&#8217;re available.<\/p>\n<\/blockquote>\n<p><!--more--><br \/>\nHere is the message that was at <a href=\"http:\/\/livejournal.com\/powerloss\">http:\/\/livejournal.com\/powerloss<\/a> as of 1-15-05 5:50pm<\/p>\n<blockquote><p>Our data center (Internap, the same one we&#8217;ve been at for many years) lost all its power, including redundant backup power, for some unknown reason. (unknown to us, at least) We&#8217;re currently dealing with verifying the correct operation of our 100+ servers. Not fun. We&#8217;re not happy about this. Sorry&#8230; :-\/ More details later.<br \/>\nUpdate #1, 7:35 pm PST: we have power again, and we&#8217;re working to assess the state of the databases. The worst thing we could do right now is rush the site up in an unreliable state. We&#8217;re checking all the hardware and data, making sure everything&#8217;s consistent. Where it&#8217;s not, we&#8217;ll be restoring from recent backups and replaying all the changes since that time, to get to the current point in time, but in good shape. We&#8217;ll be providing more technical details later, for those curious, on the power failure (when we learn more), the database details, and the recovery process. For now, please be patient. We&#8217;ll be working all weekend on this if we have to.<\/p>\n<p>Update #2, 10:11 pm: So far so good. Things are checking out, but we&#8217;re being paranoid. A few annoying issues, but nothing that&#8217;s not fixable. We&#8217;re going to be buying a bunch of rack-mount UPS units on Monday so this doesn&#8217;t happen again. In the past we&#8217;ve always trusted Internap&#8217;s insanely redundant power and UPS systems, but now that this has happened to us twice, we realize the first time wasn&#8217;t a total freak coincidence. C&#8217;est la vie.<\/p>\n<p>Update #3: 2:42 am: We&#8217;re starting to get tired, but all the hard stuff is done at least. Unfortunately a couple machines had lying hardware that didn&#8217;t commit to disk when asked, so InnoDB&#8217;s durability wasn&#8217;t so durable (though no fault of InnoDB). We restored those machines from a recent backup and are replaying the binlogs (database changes) from the point of backup to present. That will take a couple hours to run. We&#8217;ll also be replacing that hardware very shortly, or at least seeing if we can find\/fix the reason it misbehaved. The four of us have been at this almost 12 hours, so we&#8217;re going to take a bit of a break while the binlogs replay&#8230; Again, our apologies for the downtime. This has definitely been an experience.<\/p>\n<p>Update #4: 9:12 am: We&#8217;re back at it. We&#8217;ll have the site up soon in some sort of crippled state while the clusters with the oldest backups continue to catch up.<\/p>\n<p>Update #5: 1:58 pm: approaching 24 hours of downtime&#8230; *sigh* We&#8217;re still at it. We&#8217;ll be doing a full write-up when we&#8217;re done, including what we&#8217;ll be changing to make sure verify\/restore operations don&#8217;t take so long if this is ever necessary again. The good news is the databases already migrated to InnoDB did fine. The bad news (obviously) is that our verify\/restore plan isn&#8217;t fast enough. And also that some of our machine&#8217;s storage subsystems lie. Anyway, we&#8217;re still at it&#8230; it&#8217;s long because we&#8217;re making sure to back up even the partially out of sync databases that we&#8217;re restoring, just in case we encounter any problems down the road with the restored copy, we&#8217;ll be able to merge them. And unfortunately backups and networks are too slow.\n<\/p><\/blockquote>\n","protected":false},"excerpt":{"rendered":"<p>This is why: http:\/\/livejournal.com\/ Temporarily Unavailable LiveJournal is currently down due to a massive power failure at our data center. We&#8217;ll provide updates at \/powerloss\/ as they&#8217;re available.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2,1,6],"tags":[],"class_list":["post-18","post","type-post","status-publish","format-standard","hentry","category-geekery","category-general","category-wordpress"],"_links":{"self":[{"href":"https:\/\/www.lee.org\/blog\/wp-json\/wp\/v2\/posts\/18","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.lee.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.lee.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.lee.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.lee.org\/blog\/wp-json\/wp\/v2\/comments?post=18"}],"version-history":[{"count":0,"href":"https:\/\/www.lee.org\/blog\/wp-json\/wp\/v2\/posts\/18\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.lee.org\/blog\/wp-json\/wp\/v2\/media?parent=18"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.lee.org\/blog\/wp-json\/wp\/v2\/categories?post=18"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.lee.org\/blog\/wp-json\/wp\/v2\/tags?post=18"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}