Guardian.co.uk: Steps to a smooth launch

At the weekend the Guardian website went through one of the most significant transformations in its history: we moved our news, politics and Observer content into the new design and new content management system, and we simultaneously launched a lot of new functionality, both internal and external.

There’s an introduction and discussion on the more public-facing aspects of this, kicked off by Emily Bell. For my part, I want to talk briefly about one of the most remarkable behind-the-scenes aspects of it: how we got the weekend launch to go so incredibly smoothly.

The secret is that the weekend’s work was only the final step after a great many before it, all of which were safely out of the way before the weekend…

Guardian.co.uk global navigation bar1. Software release

The actual software was released some weeks ago, in early January. This means that by the time of the launch it had been in use for some time, almost all the lines of code having been executed several hundred (and in some cases thousand) times already, in the production environment.

Even that January release was only an enhancement of previous releases which have been going out fairly quietly over the previous few months. The latest one included new internal tools, and updates to tools, to support some of the new features that are visible today.

Guardian and Observer archive navigation2. Building the pages

Meanwhile editors, subeditors and production staff have had to learn to use those tools. They’ve been using them to migrate a lot of the older content into the new-format pages. You might think that could be done by machine, but that’s not the case. Since we’re also changing our information architecture — adding a lot of semantic structure where previously there was only presentational information — it takes an informed human being to know how to convert an older page into its newer format. A real person is needed to understand what a piece of content is really about, and what it isn’t about but merely mentions in passing. We also need people to look at the new keyword pages (for example, the one on immigration and asylum, or the page on Kosovo), which are populated mostly automatically, and for them to highlight the most important content: the most important content won’t necessarily be the newest.

This work had been going on for many weeks before the weekend launch. The January software release brought in some tools refinements to help them tidy up final loose ends (and no doubt some more tidying will happen over the next couple of weeks).

You can see from this that it’s about much more than mere software. The software enables people to do the editorial work, and the editorial work has been going on for some considerable time. Everything they’ve done has also been tested and previewed, which allows them to see what our external users would see were it to go live. Again, this exercises the software fully, but only within the internal network, before it’s exposed to the outside world.

3. Rehearsals

The work for the weekend launch is mainly running a lot of database scripts to make various new URLs live and decommission old ones. The reason this is such a huge launch is that there’s over ten years’ worth of news content to expose, as well as new functionality and designs.

We couldn’t trust the scripts to work first time (of course), so we spent a lot of time copying the production content into a safe environment, and rehearsing the process there, with real data. We needed to be sure not just it would work, but also how long it would take (considering any hardware differences), and change the scripts and process accordingly.

Guardian.co.uk favicon4. Launch

Finally, after all the rehearsals, the real deal. The work to run the database scripts and raise the curtain on various features ran over Saturday and Sunday, but it was calm enough and organised enough that the team needed to work only slightly extended working days.

So the big launch was a culmination of a huge amount of effort, and the weekend work was after an awful lot of practice. There were a couple of sticky moments, but nothing the team couldn’t recover from in a few minutes. As one of the developers remarked towards the end of Sunday: “The fact that today’s been really tedious is a good thing.”

What we can see now includes

What’s next…

We’ll be ironing out a few things over the next few days, but everything’s gone to plan for now. And then, as Emily says, there’s still sport, arts, life and style, and education to do.

The Times Online’s redesign, and a word about performance testing

Late last night (UK time) the Times Online launched their new design, and jolly nice it is, too. It’s clean and spacious, and there’s an interview with the designers who are very excited about the introduction of lime green into the logo. Personally, I like the columnists’ pictures beside their pull-quotes. That’s a nice touch. You can also read about it from the Editor in Chief, Anne Spackman.

However, not everyone at the website will have been downing champagne as the moment the new site went live, because in the first few hours it was clearly having some performance issues. We’ve all had those problems some time in our careers (it’s what employers call “experience”), and it’s never pleasant. As I write it looks as though the Times Online might be getting back to normal, and no doubt by the time you read this the problems will all be ancient history. So while we give their techies some breathing space to get on with their jobs, here are three reasons why making performant software is not as easy as non-techies might think…

1. Software design

A lot of scalability solutions just have to be built in from the start. But you can’t do that unless you know what the bottlenecks are going to be. And you won’t know what the bottlenecks are going to be until you’ve built the software and tested it. So the best you can do from the start is make some good judgements and decide how you’d like to scale up the system if needed.

Broadly speaking you can talk of “horizontal scaling” and “vertical scaling”. The latter is when you can scale by keeping the same boxes but beefing them up — giving them more memory/CPU/etc. The former is where you can scale by keeping the same boxes end-to-end, but add more alongside them. Applications are usually designed for one or the other (deliberately or not) and it’s good to know which before you start.

Vertical scaling seems like an obvious solution generally, but if you’re at the limit of what your hardware will allow then you’re at the limit of your scalability. Meanwhile a lot has been made of Google’s MapReduce algorithm which explicitly allowed parallelisation for the places it was applied — it allowed horizontal scaling, adding more machines. That’s very smart, but they’ll have needed to apply that up-front — retrofitting it would be very difficult.

You can also talk about scaling on a more logical level. For example, sometimes an application would do well to split into two distinct parts (keeping its data store separate from its logic, say) but if that was never considered when the application was built then it will be too late once the thing has been build — there will be too many inter-dependencies to untangle.

That can even happen on a smaller scale. It’s a cliche that every programming problem can be solved with one more level of indirection, but you can’t build in arbitrary levels of indirection at every available point “just in case”. At Guardian Unlimited we make good use of the Spring framework and its system of Inversion of Control. It gives us more flexibility over our application layering, and one of our developers recently demonstrated to me a very elegant solution to one particular scaling problem using minimally-invasive code precisely because we’d adopted that IoC strategy — effectively another level of indirection. Unfortunately we can’t expect such opportunities every time.

How to scale down your production environment2. Devising the tests

Before performance testing, you’ve got to know what you’re actually testing. Saying “the performance of site” is too broad. There’s likely to be a world of difference between:

  • Testing the code that pulls an article out of the database;
  • Testing the same code for 10,000 different articles in two minutes;
  • Testing 10,000 requests to the application server;
  • Testing 10,000 requests to the application server via the web server;
  • Testing the delivery of a page which includes many inter-linked components.

Even testing one thing is not enough. It’s no good testing the front page of the website and then testing an article page, because in reality requests come simultaneously to the front page and many article pages. It’s all very well testing whether they can work smoothly alone — it’s whether they work together in practice that counts. This is integration testing. And in practice many, many combinations of things happen together in an integrated system. You’ve got to make a call on what will give the best indicators in the time available.

Let me give two examples of integration testing from Guardian Unlimited. Kevin Gould’s article on eating in Barcelona is very easy to extract from the database — ask for the ID and get the content. But have a look down the side of the page and you’ll see a little slot that shows the tags associated with the article. In this case it’s about budget travel, Barcelona, and so on. That information is relatively expensive to generate. It involves cross referencing data about the article with data about our tags. So testing the article is fine, but only if we test it with the tags (and all the other things on the page) will we get half an idea about the performance in practice.

A second example takes us further in this direction. Sometimes you’ve got to test different operations together. When we were testing one version of the page generation sub-system internally we discovered that it slowed down considerably when journalists tried to launch their content. There was an interaction between reading from the database, updating the cache, and changing the content within the database. This problem was caught and resolved before anything went live, but we wouldn’t have found that if we hadn’t spent time dry-running the system with real people doing real work, and allowing time for corrections.

3. Scaling down the production environment

Once you’ve devised the tests, you’ll want to run them. Since performance testing is all about having adequate resources (CPU, memory, bandwidth, etc) then you really should run your tests in an exact replica of the production environment, because that’s the only environment which can show you how those resources work together. However, this is obviously going to be very expensive, and for all but the most cash-rich of organisations prohibitively so.

So you’ll want to make a scaled down version of the production environment. But that has its problems. Suppose your production environment has four web servers with two CPUs and six application servers with 2GB of RAM each. What’s the best way to cut that down? Cutting it down by a half might be okay, but if that’s still too costly then cutting it further is tricky. One and half application servers? Two application servers with different amounts of RAM or CPU?

None of these options will be a real “system in miniature”, so you’re going to have to compromise somewhere. It’s a very difficult game to play, and a lot of the time you’ve got to play to probabilities and judgement calls.

And that’s only part of it

So next time you fume at one of your favourite websites going slow, by all means delve deep into your dictionary of expletives, but do also bear in mind that producing a performant system is not as easy as you might think. Not only does it encompass all the judgement calls and hard thinking above (and usually judgement calls under pressure), but it also includes a whole lot of really low-level knowledge both from software developers and systems administrators. And then, finally, be thankful you’re not the one who has to fix it. To those people we are truly grateful.

Measuring development is useful, up to a point

There’s a post from Joel O’Software regarding measuring performance and productivity. He’s saying some good stuff about how these metrics don’t work, but I’d like to balance it with a few further words in favour of metrics generally. Individual productivity metrics don’t work, but some metrics are still useful, including team metrics which you might class as productivity-related.

  • Individual productivity metrics don’t work.
  • Productivity-like metrics are still useful…
  • …but they don’t tell the whole story

Individual productivity metrics don’t work

Joel O’S states that if you incentivise people by any system, there are always opportunities to game that system. My own experience here is in a previous company where we incentivised developers by how much client-billable time they clocked up. Unfortunately it meant that the developers flatly refused to do any work on our internal systems. We developed internal account codes to deal with that, but it just meant that our incentivisation scheme was broken as a result. Joel has other examples, and Martin Fowler discusses the topic similarly.

Productivity-like metrics are still useful…

Agile development people measure something called “velocity”. It measures the amount of work delivered in an iteration, and as such might be called a measurement of productivity. But there are a couple of crucial differences to measuring things such as lines of code, or function points:

  • Velocity is a measurement of the team, not an individual.
  • It’s used for future capacity planning, not rewarding past progress.

Velocity can also be used in combination with looking at future deadlines to produce burndown charts and so allow you to make tactical adjustments accordingly. Furthermore, a dip in any of these numbers can highlight that something’s going a bit wrong and some action needs to be taken. But that tells you something about the process, not the individuals.

The kick-off point for Joel’s most recent essay on the subject is a buzzword-ridden (and just clunkily-worded) cold-call e-mail from a consultant:

Our team is conducting a benchmarking effort to gather an outside-in view on development performance metrics and best practice approaches to issues of process and organization from companies involved in a variety of software development (and systems integration).

It’s a trifle unfair to criticise the consultant for looking at performance metrics, but one has to be careful about what they’re going to be used for.

…but they don’t tell the whole story

A confession. We track one particular metric here in the development team at Guardian Unlimited. And a few days ago we recorded our worst ever figure for this metric since we started measuring it. You could say we had our least productive month ever. You could. But were my management peers in GU unhappy? Was there retribution? No, there was not. In fact there was much popping of champagne corks here, because we all understand that progress isn’t measured by single numbers alone. The celebrations were due to the fact that, with great effort from writers, subs, commercial staff, sponsors, strategists and other technologists we had just launched

A bad month then? Not by a long shot. The numbers do tell us something. They tell me there was a lot of unexpected last-minute running around, and I’ve no doubt we can do things better the next time. It’s something I’ve got to address. But let’s not flog ourselves too much over it — success is about more than just numbers.