Category: Planning

Guardian.co.uk: Steps to a smooth launch

At the weekend the Guardian website went through one of the most significant transformations in its history: we moved our news, politics and Observer content into the new design and new content management system, and we simultaneously launched a lot of new functionality, both internal and external.

There’s an introduction and discussion on the more public-facing aspects of this, kicked off by Emily Bell. For my part, I want to talk briefly about one of the most remarkable behind-the-scenes aspects of it: how we got the weekend launch to go so incredibly smoothly.

The secret is that the weekend’s work was only the final step after a great many before it, all of which were safely out of the way before the weekend…

Guardian.co.uk global navigation bar1. Software release

The actual software was released some weeks ago, in early January. This means that by the time of the launch it had been in use for some time, almost all the lines of code having been executed several hundred (and in some cases thousand) times already, in the production environment.

Even that January release was only an enhancement of previous releases which have been going out fairly quietly over the previous few months. The latest one included new internal tools, and updates to tools, to support some of the new features that are visible today.

Guardian and Observer archive navigation2. Building the pages

Meanwhile editors, subeditors and production staff have had to learn to use those tools. They’ve been using them to migrate a lot of the older content into the new-format pages. You might think that could be done by machine, but that’s not the case. Since we’re also changing our information architecture — adding a lot of semantic structure where previously there was only presentational information — it takes an informed human being to know how to convert an older page into its newer format. A real person is needed to understand what a piece of content is really about, and what it isn’t about but merely mentions in passing. We also need people to look at the new keyword pages (for example, the one on immigration and asylum, or the page on Kosovo), which are populated mostly automatically, and for them to highlight the most important content: the most important content won’t necessarily be the newest.

This work had been going on for many weeks before the weekend launch. The January software release brought in some tools refinements to help them tidy up final loose ends (and no doubt some more tidying will happen over the next couple of weeks).

You can see from this that it’s about much more than mere software. The software enables people to do the editorial work, and the editorial work has been going on for some considerable time. Everything they’ve done has also been tested and previewed, which allows them to see what our external users would see were it to go live. Again, this exercises the software fully, but only within the internal network, before it’s exposed to the outside world.

3. Rehearsals

The work for the weekend launch is mainly running a lot of database scripts to make various new URLs live and decommission old ones. The reason this is such a huge launch is that there’s over ten years’ worth of news content to expose, as well as new functionality and designs.

We couldn’t trust the scripts to work first time (of course), so we spent a lot of time copying the production content into a safe environment, and rehearsing the process there, with real data. We needed to be sure not just it would work, but also how long it would take (considering any hardware differences), and change the scripts and process accordingly.

Guardian.co.uk favicon4. Launch

Finally, after all the rehearsals, the real deal. The work to run the database scripts and raise the curtain on various features ran over Saturday and Sunday, but it was calm enough and organised enough that the team needed to work only slightly extended working days.

So the big launch was a culmination of a huge amount of effort, and the weekend work was after an awful lot of practice. There were a couple of sticky moments, but nothing the team couldn’t recover from in a few minutes. As one of the developers remarked towards the end of Sunday: “The fact that today’s been really tedious is a good thing.”

What we can see now includes

What’s next…

We’ll be ironing out a few things over the next few days, but everything’s gone to plan for now. And then, as Emily says, there’s still sport, arts, life and style, and education to do.

The Times Online’s redesign, and a word about performance testing

Late last night (UK time) the Times Online launched their new design, and jolly nice it is, too. It’s clean and spacious, and there’s an interview with the designers who are very excited about the introduction of lime green into the logo. Personally, I like the columnists’ pictures beside their pull-quotes. That’s a nice touch. You can also read about it from the Editor in Chief, Anne Spackman.

However, not everyone at the website will have been downing champagne as the moment the new site went live, because in the first few hours it was clearly having some performance issues. We’ve all had those problems some time in our careers (it’s what employers call “experience”), and it’s never pleasant. As I write it looks as though the Times Online might be getting back to normal, and no doubt by the time you read this the problems will all be ancient history. So while we give their techies some breathing space to get on with their jobs, here are three reasons why making performant software is not as easy as non-techies might think…

1. Software design

A lot of scalability solutions just have to be built in from the start. But you can’t do that unless you know what the bottlenecks are going to be. And you won’t know what the bottlenecks are going to be until you’ve built the software and tested it. So the best you can do from the start is make some good judgements and decide how you’d like to scale up the system if needed.

Broadly speaking you can talk of “horizontal scaling” and “vertical scaling”. The latter is when you can scale by keeping the same boxes but beefing them up — giving them more memory/CPU/etc. The former is where you can scale by keeping the same boxes end-to-end, but add more alongside them. Applications are usually designed for one or the other (deliberately or not) and it’s good to know which before you start.

Vertical scaling seems like an obvious solution generally, but if you’re at the limit of what your hardware will allow then you’re at the limit of your scalability. Meanwhile a lot has been made of Google’s MapReduce algorithm which explicitly allowed parallelisation for the places it was applied — it allowed horizontal scaling, adding more machines. That’s very smart, but they’ll have needed to apply that up-front — retrofitting it would be very difficult.

You can also talk about scaling on a more logical level. For example, sometimes an application would do well to split into two distinct parts (keeping its data store separate from its logic, say) but if that was never considered when the application was built then it will be too late once the thing has been build — there will be too many inter-dependencies to untangle.

That can even happen on a smaller scale. It’s a cliche that every programming problem can be solved with one more level of indirection, but you can’t build in arbitrary levels of indirection at every available point “just in case”. At Guardian Unlimited we make good use of the Spring framework and its system of Inversion of Control. It gives us more flexibility over our application layering, and one of our developers recently demonstrated to me a very elegant solution to one particular scaling problem using minimally-invasive code precisely because we’d adopted that IoC strategy — effectively another level of indirection. Unfortunately we can’t expect such opportunities every time.

How to scale down your production environment2. Devising the tests

Before performance testing, you’ve got to know what you’re actually testing. Saying “the performance of site” is too broad. There’s likely to be a world of difference between:

  • Testing the code that pulls an article out of the database;
  • Testing the same code for 10,000 different articles in two minutes;
  • Testing 10,000 requests to the application server;
  • Testing 10,000 requests to the application server via the web server;
  • Testing the delivery of a page which includes many inter-linked components.

Even testing one thing is not enough. It’s no good testing the front page of the website and then testing an article page, because in reality requests come simultaneously to the front page and many article pages. It’s all very well testing whether they can work smoothly alone — it’s whether they work together in practice that counts. This is integration testing. And in practice many, many combinations of things happen together in an integrated system. You’ve got to make a call on what will give the best indicators in the time available.

Let me give two examples of integration testing from Guardian Unlimited. Kevin Gould’s article on eating in Barcelona is very easy to extract from the database — ask for the ID and get the content. But have a look down the side of the page and you’ll see a little slot that shows the tags associated with the article. In this case it’s about budget travel, Barcelona, and so on. That information is relatively expensive to generate. It involves cross referencing data about the article with data about our tags. So testing the article is fine, but only if we test it with the tags (and all the other things on the page) will we get half an idea about the performance in practice.

A second example takes us further in this direction. Sometimes you’ve got to test different operations together. When we were testing one version of the page generation sub-system internally we discovered that it slowed down considerably when journalists tried to launch their content. There was an interaction between reading from the database, updating the cache, and changing the content within the database. This problem was caught and resolved before anything went live, but we wouldn’t have found that if we hadn’t spent time dry-running the system with real people doing real work, and allowing time for corrections.

3. Scaling down the production environment

Once you’ve devised the tests, you’ll want to run them. Since performance testing is all about having adequate resources (CPU, memory, bandwidth, etc) then you really should run your tests in an exact replica of the production environment, because that’s the only environment which can show you how those resources work together. However, this is obviously going to be very expensive, and for all but the most cash-rich of organisations prohibitively so.

So you’ll want to make a scaled down version of the production environment. But that has its problems. Suppose your production environment has four web servers with two CPUs and six application servers with 2GB of RAM each. What’s the best way to cut that down? Cutting it down by a half might be okay, but if that’s still too costly then cutting it further is tricky. One and half application servers? Two application servers with different amounts of RAM or CPU?

None of these options will be a real “system in miniature”, so you’re going to have to compromise somewhere. It’s a very difficult game to play, and a lot of the time you’ve got to play to probabilities and judgement calls.

And that’s only part of it

So next time you fume at one of your favourite websites going slow, by all means delve deep into your dictionary of expletives, but do also bear in mind that producing a performant system is not as easy as you might think. Not only does it encompass all the judgement calls and hard thinking above (and usually judgement calls under pressure), but it also includes a whole lot of really low-level knowledge both from software developers and systems administrators. And then, finally, be thankful you’re not the one who has to fix it. To those people we are truly grateful.

Measuring development is useful, up to a point

There’s a post from Joel O’Software regarding measuring performance and productivity. He’s saying some good stuff about how these metrics don’t work, but I’d like to balance it with a few further words in favour of metrics generally. Individual productivity metrics don’t work, but some metrics are still useful, including team metrics which you might class as productivity-related.

  • Individual productivity metrics don’t work.
  • Productivity-like metrics are still useful…
  • …but they don’t tell the whole story

Individual productivity metrics don’t work

Joel O’S states that if you incentivise people by any system, there are always opportunities to game that system. My own experience here is in a previous company where we incentivised developers by how much client-billable time they clocked up. Unfortunately it meant that the developers flatly refused to do any work on our internal systems. We developed internal account codes to deal with that, but it just meant that our incentivisation scheme was broken as a result. Joel has other examples, and Martin Fowler discusses the topic similarly.

Productivity-like metrics are still useful…

Agile development people measure something called “velocity”. It measures the amount of work delivered in an iteration, and as such might be called a measurement of productivity. But there are a couple of crucial differences to measuring things such as lines of code, or function points:

  • Velocity is a measurement of the team, not an individual.
  • It’s used for future capacity planning, not rewarding past progress.

Velocity can also be used in combination with looking at future deadlines to produce burndown charts and so allow you to make tactical adjustments accordingly. Furthermore, a dip in any of these numbers can highlight that something’s going a bit wrong and some action needs to be taken. But that tells you something about the process, not the individuals.

The kick-off point for Joel’s most recent essay on the subject is a buzzword-ridden (and just clunkily-worded) cold-call e-mail from a consultant:

Our team is conducting a benchmarking effort to gather an outside-in view on development performance metrics and best practice approaches to issues of process and organization from companies involved in a variety of software development (and systems integration).

It’s a trifle unfair to criticise the consultant for looking at performance metrics, but one has to be careful about what they’re going to be used for.

…but they don’t tell the whole story

A confession. We track one particular metric here in the development team at Guardian Unlimited. And a few days ago we recorded our worst ever figure for this metric since we started measuring it. You could say we had our least productive month ever. You could. But were my management peers in GU unhappy? Was there retribution? No, there was not. In fact there was much popping of champagne corks here, because we all understand that progress isn’t measured by single numbers alone. The celebrations were due to the fact that, with great effort from writers, subs, commercial staff, sponsors, strategists and other technologists we had just launched

A bad month then? Not by a long shot. The numbers do tell us something. They tell me there was a lot of unexpected last-minute running around, and I’ve no doubt we can do things better the next time. It’s something I’ve got to address. But let’s not flog ourselves too much over it — success is about more than just numbers.

Big bangs and Telegraph.co.uk’s redesign

Technical people don’t tend to like a big bang release, but it’s often perceived as the only viable option when you launch a new service. Why is this, and is there any kind of middle ground? The recent redesign of Telegraph.co.uk is a useful way to look at this. I’m also going to use examples from Guardian Unlimited, most notably Comment is Free.

What is a big bang release?

A big bang release is one in which a large system is released all in one go. Recent examples from Guardian Unlimited include the Comment is Free site and our daily podcasts. This week the Telegraph launched its website redesign. In the offline world, UK newspapers have been in a state of almost constant change recently, kicked off largely by the Independent’s sudden transformation from a broadsheet to a tabloid in September 2003. The Times followed suit shortly after, and in September 2005 The Guardian changed to the “Berliner” size, new to any British paper. These launches can only be done in one go: you can’t have a newspaper that’s broadsheet on page 1 and tabloid on page 2.

Contrast that to…

At the opposite end of the spectrum to the big bang is the mantra “release early, release often”. This is how the Linux community claims much of its success. Much of GU’s work is done this way: while there’s always a project in the pipeline like Comment is Free, most of the time we’re making small improvements and adding new features day by day. Some of these are internal to our colleagues, some are bugfixes, but many are small features added to the public site. In January the Weekend magazine ran a photo competition with Windows XP. We worked with them to produce an online entry form that allowed our readers to upload their photographic entries directly to us.

So what are the worries about big bangs?

Big bangs give you a jump on your competitors and give you a big marketing boost. But they also put an awful lot at stake. Releasing a photo uploader is one thing. That’s more of a feature release than a product release. Consider what makes up Comment is Free, which the GU team created in collaboration with Ben Hammersley. In no particular order:

  • Steve Bell’s If… cartoon every day;
  • Comments on blog entries;
  • Keywords applied to each entry;
  • Commenters need to sign in (or register) before they can comment;
  • An RSS feed for the whole blog, plus one for each contributor;
  • A one-click feature for editors behind the scenes to “push” an article from the Guardian or the Observer into Comment is Free.
  • Dan Chung’s photo blog;
  • A list of the most active entries on the blog…
  • …and the most interesting from elsewhere;
  • A search facility;
  • The ability to serve all this even with the inevitable battering of thousands of readers check it out every day.

And that’s not counting the editorial team powering it and the work of the designers creating something is beautiful and works across a good range of web browsers (including the famously fickle Internet Explorer).

Using the example of Comment is Free there are several kinds of difficulties with big bangs. One is that there is a lot to potentially go wrong. Another other is that there may be a big switch-over which is difficult to engineer and risky to run. A third is relates to community reaction and fourth relates to social psychology.

Difficulty 1: The lots-to-go-wrong scenario

Within hours of launch, any of the Comment is Free features above (and any number of others which I forgot) could have failed in unpredictable but surprisingly common circumstances. The important thing to remember is that this is the first time any of these features would be tested by unknown people, and the first time they’re used together in unpredictable ways. A problem with a core component of the system can have an impact on many aspects visible to the readers. A big bang can’t be rolled back without a lot of egg on face.

The problem is much more acute with anything that makes a fundamental change to a core part of a system. In any one hour we’ll be serving hundreds of thousands of pages, some of which were published five minutes ago, some of which were published five years ago. A change to the way we present articles is going to have consequences for thousands of people in myriad different situations over the course of a few minutes. If we got something wrong there, misjudged some of those situations, we could annoy hundreds or thousands of them very quickly. And since such problems can take time to fix, more people will be annoyed the longer it takes.

All but the smallest of software launches throw up some problems. Often they are too minor to worry too much about (a particular colour change not working in a particular browser, for instance). Sometimes they are superseded by more significant problems in the same launch. With a small feature release (early and often) we can spend a little time and fix it. But if a sweeping change has occurred there’s the danger of having too many things to fix in a timely fashion. Even the original big bang took several millenia to settle down, although to be fair that was mostly a hardware rollout.

Difficulty 2: The big-switch-over scenario

With some systems it’s not a case of making a big thing public in one shot; sometimes it’s a major internal mechanism that has to be switched over in one go.

Adding comments to articles might involve not just mean an addition to the underlying system, but a wholesale either/or change. The system must work not just before and after the change, but during the change, too. And everything that links into the part which changes must be somehow unlinked and relinked in good time.

If you think this language (part, unlink, etc) is vague, by the way, there’s a reason for that: exactly what needs to be changed will be many and varied. Database tables, triggers, stored procedures, software libraries, application code, outgoing feeds, incoming feeds, application servers, web servers… all these might need “unlinking” (dropping, reinstalling, stopping, bouncing, whatever). It needs careful planning, co-ordinating, and execution, and it might only be possible over several minutes, which is a very long time if you’re running a live service.

Again, rehearsing this and doing it for real in the live environment are very different things. But this time if something goes wrong you’re stuck with a broken service or something which will take just as long to roll back — which will only leave you back at square one.

Difficulty 3: Community reaction

Even if everything goes well on a technical level there’s still the danger that your user community won’t like it.

A failure for Comment is Free, Emily Bell has written, would have been if no readers had posted any comments at all. That would have been an awful lot of work wasted. But at least in that scenario we could have quietly shut it down and walked away.

A more difficult scenario would have been if people were using it, but found a fundamental feature annoying or obsctructive. A good example of this is the page width on Comment is Free. 99% of Guardian Unlimited pages are designed to fit a monitor which is 800 pixels wide. That was pretty standard six years ago when the site was designed, but is less common now. Both Comment is Free and the Telegraph’s new design require a width of at least 1000 pixels. That gives much greater design freedom and hence a better user experience — unless you can’t or won’t stretch your browser that wide, in which case it could be very alienating. If readers react badly to the 1000 pixel design then there’s very little that can be done about it. Sure you can redesign, but that’s a major undertaking.

Difficulty 4: Psychology

So, let’s suppose the features all worked, the switch-over all worked, and the community doesn’t react negatively. There’s still the problem that people, well, they just might not gel with it.

Blogs are a very good example of this. They’ve been around for almost a decade, and talk/discussion software has been around for much longer. But blogs only really took off in the last two or three years. Most of this is entirely down to getting the psychology right:

  • A blog is now recognised as a largely personal affair, being the soap box for individuals rather than a free-for-all forum;
  • discussion tends to be just a single column of comments rather than threaded, ensuring commenters respond more to the original poster, rather than getting into an internal discussion;
  • most bloggers write less content but write it more often.

That list is not something that falls straight out of a grand vision. It’s a series of very subtle psychological and social lessons that the whole online community learned over a very long period. Similarly any software release relies on social and psychological subtleties that can’t always be guessed at, and will often overlap each other. Making a single big release will obscure any of these subtleties. If people aren’t reacting to your product in quite the way you hoped then working out why could be difficult: a small tweak to a couple of presentational elements might make all the difference in the world… or perhaps your promotion of the product just wasn’t enough.

How can we mitigate this?

From the outside it looks like Telegraph’s website redesign was executed — is being executed as I write — without breaking a sweat (though I daresay internally there were a lot of frayed nerves as well as excitement). It gives us a few clues about how to manage a big bang.

Mitigation: Splash instead of bang

One trick is to make a big splash, rather than a big bang. The website’s redesign is actually rolling out over days and weeks, not minutes or hours. The team made sure to change the high-profile front page and World Cup pages first, so ensuring the perceived impact was greatest. They get a lot of credit without risking more than they need to.

Mitigation: Step-by-step approach

Even if plaudits is not what you’re after it can be useful to slow big bangs into small, distinct steps.

A while ago at GU we need to switch from one kind of server to another. On the face of it there was no way to smooth this switch: requests either went to one server, or they went to another. Nevertheless, any kind of problem switching over could have been potentially disastrous. We took extra time to build in a middle tier, which brokered incoming requests to the back-end servers, and added into that a facility to route different classes of requests to one or other of the servers. Initially everything went to the one server. Then we altered it so that only one specific kind of request, a tiny fraction of the whole, was routed to the other. We ensured that was working smoothly before we sent another class of requests over, and so on. We took what would originally have been a dangerous all-or-nothing switch, and smoothed it out over several weeks.

Mitigation: Avoid feature creep

Another thing the Telegraph team have done is to avoid feature creep. The redesign comes with almost no new features — I can count the ticker, but the rest seem to be rearrangements of the same material. This contrasts (unfairly, but instructively) with, say, the Guardian’s Berliner reformatting where the editor took a conscious decision not just to change the size of the paper, but also the design, the section arrangements, the approach to presenting the writing (distinguishing news and comment more distinctly, referencing the online world more) and numerous internal processes.

Avoiding feature creep on software projects is very, very difficult, and it takes strong wills and authority to say No to the many voices who want just a little bit more.

Mitigation: Don’t just test — rehearse

Testing individual features is one thing: testing the whole lot together is another. That’s why rehearsing the whole sequence together is important. And there’s no point rehearsing if you’ve not built in time to make changes when (not if) you discover there are flaws in your plan.

Mitigation: Plan for natural breaks

The Telegraph also seem to have a release plan with plenty of natural breaks: once started it seems they weren’t on a time-critical unstoppable slide. Of course they will have had deadlines, but if they had to slip some of these for the sake of something else then that was always a possibility.

A good example of this is their replacement of the missing e-mail and print features. The site’s articles originally had links to mail them to friends or print them out. The original rollout omitted these features, and there was a promise that they would be restored later. However the outcry from the community was such that they interrupted their schedule and reinserted them earlier than planned. It’s a good example of listening to your community, but it also shows that you need to the flexibility to react accordingly.

Mitigation: The safe but indirect route

Big-switch-over-scenarios can be avoided by ensuring you take a route that’s not necessarily direct but is at least safe. You may end up with a very long and winding plan, however. This is a bit like the word game where you have to go from one word to another, changing one letter at a time, but still ensuring that you’ve got a word at every step of the way.

W H I T E
W H I N E
S H I N E
S H I R E
S H I R K
S H A R K
S H A C K
S L A C K
B L A C K

This can still give you your big bang if the final step is the only visible step. The Telegraph redesign will have involved an awful lot more behind the scenes than the apparently straightforward change first seen by the public. No doubt they could have done it quicker at the expense of a more confusing user experience, but they managed to keep the end user change to minimum.

Mitigation: Trial features elsewhere

If you’re unsure of a particular part of a big bang release then it might be possible to trial just that feature elsewhere. As noted above, Comment is Free was the first large scale GU site with a thousand pixel width. It was a calculated risk. Reaction to that will inevitably inform any future services we launch.

Mitigation: Loosely coupled systems

Another mitigation technique to allow a big bang is to ensure the product being released is on an entirely separate system to anything else. If you’ve been following Comment is Free very closely, you’ll have read Ben Hammersley’s comment (you’ll need to scroll down or search for his name) that it’s based on the Movable Type blogging software. MT is certainly not what we use as our main content management system and it’s necessarily separate. This makes the big bang release easier: there is integration, but it’s loosely coupled. The systems are linked, but they’re not tied that closely that a disaster with one will adversely affect the other.

Don’t think it’s solved

All of the above approaches (and the many more I’ve inevitably missed) help, but they aren’t silver bullets. Each one has its drawbacks. A splash might not always be a commercially-acceptable compromise; there may be no obvious step-by-step solution; feature creep relies on more strength than organisations usually possess in the heat of the excitement, and even then it only limits the problems without eliminating them; rehearsing requires time and an expensive duplication of the entire production environment; natural breaks can be difficult to identify, and can rely on a running-order that isn’t compatible with commercial interests; the indirect route is necessarily lengthier and most costly; feature trials are only possible in moderation; loosely coupled systems limit the amount of integration and future development you can do.

The Telegraph has clearly encountered some of these problems.

It’s notable that “Education” disappeared from their navigation bars initially, and was only findable via the search engine, as Shane Richmond recommended on one blog comment to readers. No doubt they couldn’t find a cost-effective route from the old design to the new that encompassed preserving all the navigational elements.

The switch to a wider page was greeted with mixed reactions, and while we limited it to Comment is Free, the Telegraph dived right in and is applying it to everything.

Finally, it’s interesting that the new design is quite sympathetic to their previous one. No doubt a lot of that is due to the Telegraph having the same cores value two years ago as it has today. But it’s some way from some of the more radical ideas they had, and no doubt that is partly a constraint of having to live with two designs for period.

So where does this leave us?

I hope this explains some of the problems with big bangs, some of the ways to deal with them, and the difficulties that still remain.