Lightweight versus heavyweight: The cost is in the management

A recent conversation with a colleague got me thinking about so-called “lightweight” systems, and when they become more trouble then they’re worth. He was frustrated by some problems he was having; even more so, he explained, because he thought he was dealing with something that was “lightweight”. It’s a seductive word, and sometimes — as with other forms of seduction — when you get more involved than you should things can get a bit sticky.

This article is an attempt to explain what lightweight really means, both in terms of benefits and drawbacks. There are also a couple of comparative examples from my own experience.

A lightweight system (plus management support)Lightweight doesn’t mean simple

People often mistake “lightweight” to mean simple or quick. But this can’t be right, because everyone wants simple and quick, and if it really meant this no-one would use anything else. Every website would be rewritten with the lightweight Ruby on Rails and every application would be sitting on top of the lightweight SQLite database. Who wouldn’t? Who doesn’t want simple? Who doesn’t want quick?

Lightweight is often good, but it must have its tradeoffs, otherwise other technologies wouldn’t exist.

From the examples below I see lightweight as offering low cost in return for low demands, high cost for high demands. Heavyweight is disproportionately high cost for low demands, but low cost for high demands.

Lightweight carries low inherent management costs. But some situations require a high degree of management control whether you like it or not. That means that if a lightweight system needs to scale up you have to wrest management from it and maintain it externally. If you can do that then the lightweight system continues to work, but if the lightweight system will not relinquish management control, or if you don’t have the discipline to keep the management going, then it won’t be effective in the long run. By contrast heavyweight systems impose management and structure of their own. This is good if you’re going to need it, as it takes the pressure of discipline off you, but it’s not effective if you didn’t need that management structure in the first place.

To illustrate this, here are a couple of lightweight/heavyweight comparison case studies…

Language example: TCL and Java

TCL is a lightweight language. You get to write Hello World in one line, it doesn’t force much structure on you, and it’s pretty relaxed about how it’s written.

TCL is so good, in fact, that was the basis of the original Guardian Unlimited website. We built Ajax-style tools with it before Ajax was known as a concept, we generated our front page from it, we used it to integrate with our ad server.

But as our site grew the language didn’t scale with it. Clever shortcuts implemented by earlier developers confused newer developers because they obscured the purpose of the code. The lack of an imposed structure meant every foray into older code involved learning its idiosyncracies from scratch. Development slowed down as we worked around older code. And when we wanted to redesign the website we found that through years of lightweight flexibility we had allowed ourselves to be tied into knots: it would be more effective to start again than to work with what we had.

In fact, for the most part we’re now using Java…

In contrast to TCL, Java is pretty heavyweight. Not only does Hello World require three lines (excluding any lines with just braces), but its philosophy of structure and layering percolates through from the core language to most of its add-ons. For example, to parse an XML document you have to drill through two abstraction layers before you can find the parser.

One Java framework that maintains this ethos is Hibernate, used for database access. Its architecture is complicated, and as usual this is to offer flexibility without relinquishing manageability. Recently a forthcoming release of the Guardian Unlimited website was failing its pre-production performance tests. Our developers tracked down a major cause of the problem to an inefficient query within Hibernate. They extracted some of the query’s logic up into the application layer and simplified what remained, rebalancing the work between the application and the database. Problem solved, performance restored. What’s relevant to our story is that the developers did this entirely within the archicture of Hibernate, so they didn’t compromise the design of the application and therefore didn’t add complexity.

CMS example: WordPress and the GU CMS

Over on ZDNet Larry Dignan extolls the virtues of WordPress and says, effectively, “What have big content management systems ever done for us?”

WordPress is the lightweight CMS I’ve chosen for this blog, and I’m very happy with it. It’s easy to install, requires almost zero maintenance, and lets me focus on the writing. And yet I’m a strong advocate of the home-grown CMS we have for our journalists and subs on the Guardian Unlimited site. Is lightweight not good enough for our journalists? What has a big CMS ever done for us?

Well, I just looked at a current article on guardian.co.uk: “Ministers ordered to assess climate cost of all decisions”. It was created with our big CMS. What’s there that WordPress couldn’t deliver?

For a start it’s got a list of linked subjects down the side, which aren’t the same as WordPress’s tags because they’re tightly managed to ensure consistency and reliability. These subjects are also categorised, so Pollution and Climate Change are subjects under Environment, while Green politics is a subject under Politics. As I write this, I note also that the pages for Pollution and Climate Change are designed differently, with Climate Change being more pictorial and feature-led. Subject categorisations and subject-specific designs are beyond what WordPress’s tags do.

Okay, so apart from the linked subjects, the categorisations, and the subject-specific designs, what has a big CMS ever done for us? I suppose it’s worth mentioning the related advertising, which as write includes a large ad for environmentally-friendly washing liquid. There are other contextual commercial elements, too, such as the sponsored features, links to green products and books, and offers of reducing energy bills and offsetting carbon emissions. And there are related articles and related galleries. And details of the article history, listing when and where it was first published, on what page and in what newspaper section.

Okay, so apart from the linked subjects, the categorisations, the subject-specific designs, the related advertising, the contextual sponsored features, the links to relevant products and books, the complementary offers, the related articles, the related galleries, and the article history, what has a big CMS ever done for us?

Well, I suppose it is serving to over 17 million unique users a month…

I’ll stop now. The point is a lightweight CMS such as WordPress could probably do any one of these things, with a bit of work. But it isn’t designed to do anywhere near all of them. And each time it’s changed to do one more of these things the more it is moved away from its core architecture and it gets closer to a point of paralysis, where nothing functions well anymore because no part of it is doing what it was designed to do. A bit like the TCL example.

Looking back

Reviewing these two examples, it’s clear that the lightweight systems became, or would become, very costly when they were pushed beyond their initial expectations. In both cases the corresponding heavyweight systems came with their own (heavy) management structure, but that management structure ensures lower running costs.

In the Hibernate example our software maintained its architecture after we’d made our performance change; anyone looking at this new code would be able to rely on previous knowledge to understand what was going on. By contrast, anyone coming fresh to a snippet of old TCL code would be starting from scratch, regardless of how much of the other TCL code they’d seen.

Similarly, the large-scale content management system at GU is internally consistent, despite its vast range of features and functionality. Once someone has learnt the principles (which, admittedly, are non-trivial) they can get to work on pretty much any part of it. Pushing WordPress to do that would have created a monster.

Lightweight systems take the management away from you. And that’s ideal, as long you don’t need that manageability.

Anti-features

Sometimes you can trust too much. Wherever I’ve worked I’ve been involved in a few examples where we listened to the customer, trusted them to know what they wanted, given it to them, and they’ve regretted it. We have delivered anti-features.

A lamp with slightly too many featuresMost recently at Guardian Unlimited our (previous) homepage had clever layout rules whose logic sometimes overrode the content that editors entered. The result was that editors were often confused that the page didn’t render according to what they had put in the system. The clever layout rules had been devised by close collaboration with the original editors and graphic designers — the expert users. They reasoned that they didn’t want anyone to unbalance the layout by entering inappropriate combinations of text and images.

But these layout rules had been forgotten over time, hence the confusion years later. Consequently the tech team was often called up regarding a supposed bug (”I’ve put this image in but it doesn’t appear”), and effort was expended only to discover that it was in fact a feature — much to the caller’s amazement. Our micro-lesson there was to give the end users the freedom they naturally expected, including the freedom to decide for themselves what was a balanced layout. If what they produced was unbalanced then the designers would steer them back in the right direction — a much more human corrective.

Those clever layout rules were an anti-feature. They were additional functionality that actually made users’ lives worse. Eventually we removed them.

Anti-features happen in highly experienced mega-corporations, too. In November 2006 Joel O’Software started what became known as “the Windows shutdown crapfest”. He compared the three shutdown options of Apple’s OS X with the astonishing nine or more shutdown options of Microsoft’s Windows Vista. Not only is that confusing for the user, but it was also incredibly painful to develop — Moishe Letvinn, a former Microsoft engineer, tells the sorry story.

But at least the Vista shutdown anti-feature made the problem visible. In the layout logic example, and others I know of, there is silent intelligence at work that leads to confusion and frustration without giving the user any visibility at all of what’s going on.

Anti-features are time-consuming to build in, because any features are time-consuming to build in. But anti-features also consume additional time in their removal.

One way to prevent anti-features is to help the end user determine the long-term consequences of what they want. Of course, you’d hope they’d think of that themselves, but you can’t avoid your responsibilities to the project as a whole if you see something that others have missed.

Another way is to adhere even more fervently to the Agile mantras of delivering early (before every last sub-feature is there), keeping it simple, and focusing ruthlessly on only the highest value work. This way we deliver first a front page without clever corrective layout logic, or one or two shutdown options only, and consider further enhancements later if we find we need them. Suggesting and doing this is easier said than done, of course, but if everyone trusts each other to listen honestly to what they have to say then it’s more likely the decisions made will be the best ones.

Meanwhile we can at least ask ourselves each time: Is this a feature or an anti-feature?

There’s nothing so permanent as temporary

Temporary workaroundAn aphorism I heard recently seems to be particularly memorable: “there’s nothing so permanent as temporary”. However, it wasn’t originally referring to software — it comes from a builder who is rebuilding the kitchen of friends. He’s from Azerbaijan, and my friends are fond of quoting him in full, to give the words maximum colour: “As we say in Azerbaijan, there’s nothing so permanent as temporary”.

It is, however, as relevant to software as it is to building (and probably to many other areas of life). The constant pressure to deliver means there is very rarely the opportunity to go back and improve a nasty historical workaround which is causing problems today. However, there are some strategies we might consider…

1. Automated testing

If the nasty programmatic stickytape is relatively small then a comprehensive suite of automatic tests should enable you to make the change relatively safely and painlessly. Of course, you need to have built up that suite of tests in the first place. And the “relatively small” caveat is important, because only then can the software people fit the replacement activty into their daily work without disrupting any schedules. If there’s no external change then this is an example of refactoring.

If the nasty workaround isn’t so small, but really does need replacing, then a much more concerted effort is needed. A plan needs to be carefully devised which allows slow piecemeal replacement. The point of the plan should be to take baby steps towards the end goal; each step should also leave the system in a stable state, because you never know when you might have to delay implementing any subsequent step. However, if everyone in the team knows the plan then the end goal can be achieved with minimum disruption to the business’s schedules.

2. Act like a trusted professional

I find trust and transparency is an increasing part of the software projects I’m involved with. Part of this is that the technical people want to give their customers options, and are keen to explain the pros and cons, so an informed decision can be made.

However, while this is usually excellent, there can be times when it is too much or undesirable. Stakeholders don’t necessarily want to be given options for every single thing, and sometimes it’s right for a technologist to make an executive decision without referring back — because they’re a professional, and because they are trusted (and employed) to be professional. In this regard it’s sometimes the responsibility of a technologist not even to entertain the possibility of a temporary workaround knowing it will become permanent.

The decisions I’ve been close to in this regard tend to be architecture or technology decisions. For example, in my current work we have, roughly speaking, legacy technologies and modern technologies. Sometimes a feature requirement would arise which is quicker to implement in the legacy technology — but of course we knew it would be more costly in the long run. If the difference in timescales was not wildly divergent then I’d encourage my team to choose the modern technology and not to offer up a choice. These days I need to do that much less, because by building up that base of modern technology it’s now easier to expand and extend that for future features. By making hard decisions in our professional capacity earlier on we’ve made it easier to make the right decisions in the future.

3. Bite the bullet

A third way to remove the temporary workaround is to just be honest about the work involved and the cost to the business. Easier said than done, but an open and frank discussion followed by adjustment and agreement will ensure the issue is examined properly and supported more widely.

One example I remember is a particularly troublesome database table. Originally conceived as an optimisation with a few hundred rows, it grew over time to be a real albatross, slowing both development and production work, and running to over 19 million rows. However, we couldn’t replace it without a lot of effort, because its tentacles were everywhere; we needed to schedule real time for it, and that would mean less time for other, more visible, work.

But when we presented the plan, two things sugared the pill. First, we had long cited the table as the reason for previous work taking longer than anyone would like, so the cost of its presence was already felt tangibly. Second, we ensured our plan was broken down using the “baby steps” approach outlined above — it would take a long time, but it would take out no more than 10% of our resources at any moment, and we could always suspend work for a period if absolutely necessary. The plan won support, was executed, and after several months our DBA typed the SQL “drop” command and we popped a bottle of champagne.

Meanwhile, back in the kitchen…

Of course, all that is assuming the temporary workaround really does need to be replaced. In my friends’ kitchen they are actually quite happy with the drawer dividers they’ve requisitioned as a spicerack. Aside from anything else, it provokes conversation and allows them to talk about their Azerbaijani builder’s many philosophies. He seems to have so many pithy sayings — all of which begin “As we say in Azerbaijan…” — that they’re beginning to suspect he may be making them up as he goes along. Still, if he does have to make a hasty exit from the building trade there may be an opening for him in the software industry.