Blogger had an outage starting Wednesday evening through Friday evening. Sorting briefly through their very terse comments and having some expertise in the industry, the story goes something like this…
- Blogger prepared some major new features to offer to the public as part of their blogging service. They apply software updates to their platform quarterly (or so), this one had some bigger changes than usual.
- Wednesday night they took Blogger into read-only mode (no new posts) and applied the software update.
- Within hours complaints of serious malfunctions began to come in. This happens sometimes with complicated software updates that can be very hard to test under actual conditions, such as Blogger with hundreds of millions of users a day.
- So now the Blogger team had a real challenge. Blogs continued to post updates (and comments and changes) in the “new” system, so if they restored the “old” system those would be lost. Yet the new system wasn’t working right.
- The Blogger team put Blogger in “read-only” mode (no changes allowed to any blog) and tried to fix the problems in the “new” system. They were unsuccessful, leaving them with no choice but to return to the “old” system.
- This they did, but it was the “old” system as of Wednesday, and it was Friday morning already. So a day of blog activity just disappeared. Now they had the problem of moving all new activity from the new system to the old system, a lot of programming done under pressure.
- They completed this, brought all the new activity to the old system and finally turned Blogger back on.
For most bloggers and Blogger based blogs this meant no new posts for 2 days, along with some posts disappearing for 2 days. For a few blogs, their whole blogs disappeared for 2 days.
All of this stinks. But software engineering is not perfect and such things do happen. Even with some expertise in the area, I’m not sure there’s anything we (the users) can do to prepare for such failures.