Adrian Short

Why must Sutton Council's bin software fail when Veolia's servers are busy?

Sutton Council has been having problems with its IT systems supporting its waste and recycling contract with Veolia. Online forms to report things like missed bin collections haven’t been working, leaving “customers” who already have a problem with the waste and recycling service itself with an extra problem with the system they need to use to report the problems.

A recent council newsletter for councillors as tweeted by Tory opposition leader Tim Crowley is quick to assign the blame to Veolia:

Over the past month, Veolia have been experiencing issues with their Echo system, which has resulted in our online systems timing out for our Customers. This has also affected the ability of our contact centre agents to log calls using the system.

The Echo system is the waste system that Veolia uses across the country to manage their contracts. Veolia have reported the cause of these issues as:

The system not being able to cope with “peak demand requests” from other Council’s (sic) systems, with duplicated/multiple requests being made, resulting in system slow down.

We have been continuously speaking to Veolia to ensure that the issues are addressed as a matter of urgency.

And at first sight this seems plausible. When someone uses Sutton Council’s systems to report a problem, they then get reported to Veolia. If Veolia’s system isn’t working at that time the report can’t be taken. The only thing the council can do is to encourage Veolia to sort out the reliability of its systems, because if Veolia’s systems are down then Sutton’s are effectively useless.

Now it should go without saying that Veolia is responsible for its own systems. They need to scale up to be able to handle “peak demand requests” as well as just stay working the rest of the time to a reasonable standard. So yes, Veolia needs to get its house in order with their IT.

But the responsibility doesn’t end there. Why is it essential that Veolia’s systems are working at the precise time that someone makes a report through Sutton’s systems? Sutton is responsible for passing a customer’s report on to Veolia, so Veolia’s system will need to work at some point to receive that report. But where is the necessity for that to happen immediately as the customer makes their report?

Forwarding that report from Sutton to Veolia doesn’t need to happen instantly. Veolia isn’t an emergency service that needs to take life-or-death action in real time. Reports for things like missed bin collections, fly tipping and dog mess on the street can be passed on seconds, minutes, hours or in some cases even days after the customer makes their report without it causing a problem. And once you realise that these reports can be forwarded at a reasonable time later, Sutton can restructure their systems so that they work even when Veolia’s aren’t working.

A common systems architecture pattern for this kind of situation is called an asynchronous task queue. It works in a similar way to a person using a todo list. Here’s how it would work in this case:

A customer fills in a form on Sutton Council’s website saying that their bin hasn’t been collected. Sutton’s system doesn’t immediately try to send this report to Veolia. Instead, it adds it to a queue or list on its own systems. The customer is then told that the report has been received and the issue will be fixed in due course.

Separately from the system that takes customer reports online, you have a worker process that goes through the queue and tries to do the tasks in it. This bit of software will work its way down the list and attempt to tell Veolia’s systems about all the problems Sutton’s customers have reported. When a problem is successfully reported, the worker process deletes it from the queue, just like ticking a completed task off a todo list. While the actual underlying work of fixing the bin problem still remains to be done, the worker process has done its part by simply forwarding the problem report to the right place.

So what happens if Veolia’s systems are down or too busy at that time to accept the report? Not a problem. It just stays in the queue and the worker process will try again later. These retries are typically spaced out so if there’s an ongoing problem with the other people’s server being overloaded you don’t run the risk of making things worse. So if retrying after 10 seconds doesn’t work, the worker might wait another minute to try again. If that doesn’t work, it might try five minutes later. If that doesn’t work, it could wait 15 minutes. But these reports will never be forgotten by Sutton’s systems. They’ll stay in the task queue until the worker process is able to forward them successfully. Even if Veolia’s systems are completely down for 24 hours those reports will eventually get through.

By structuring its systems defensively, Sutton Council can make its service robust and resilient to slowdowns and shutdowns on Veolia’s servers. There’s no need to create a dependency that both Sutton’s and Veolia’s systems must be working at literally the same time for a customer report to be received. As long as Sutton’s systems are working they can take the reports, store them, and forward them on when Veolia’s systems are working.

The asynchronous task queue is a very common software pattern. Almost every significant web application uses a queue for communications with external systems that need to happen eventually but not immediately. They’re also used for slow processes like sending email and resizing images. When you place an order at an online shop you typically get a confirmation email. But behind the scenes that email isn’t sent the instant you place your order. Instead, the software puts the “send this customer a confirmation email” task in a queue, and a separate worker process goes through that queue and sends those emails out as fast as the mail server can do them, but not immediately. If the mail server is busy and you get that confirmation email five minutes after your order rather than two, it’s not a problem. Off-the-shelf software for creating, managing and processing queues exists for just about every conceivable system. There’s no need to build one from scratch.

So if Sutton Council’s systems really are joined at the hip to Veolia’s as the newsletter implies, why didn’t they use a queue to ensure that an outage at Veolia doesn’t effectively bring down Sutton’s systems too? Sutton’s IT integration work with Veolia has so far cost £600,000 and ongoing development work is costing the council £17,000 a week. This isn’t a trivial hobby project. It’s something that’s getting built by very well paid professionals and that needs to work in a real-world environment, at scale, where it’s a given that external partner systems won’t be 100% reliable.

But the bigger problem isn’t just about one system apparently being built in a suboptimal way. It’s about the huge amount of money and other resources being wasted reinventing the wheel as every council does essentially the same thing: building things like “report a missed bin collection” forms and then linking that to either their own systems or their outsourced contractor’s. While some of the specifics will vary from council to council, there is also a huge commonality to these processes too.

If councils were writing open source software, parts or all of those systems could be shared and reused between councils, driving up the quality of the software as it gets improved by a larger group, while also driving down costs. Even where open source software doesn’t directly get reused, developers can simply read other people’s code and see how they’ve solved similar problems. Perhaps they’ve done it well. Perhaps not so well. But it gives an opportunity to learn from other people’s experiences rather than having to work it all out yourself from scratch.

When you’re writing complex systems that people rely on, you need to give yourself every conceivable advantage of doing it well. Just paying a team of people top dollar on the assumption that they know what they’re doing really doesn’t cut it. Everyone needs to be working in an environment where everyone can learn from everyone else all the time. And that means building software in the open rather than in private. It means sharing code rather than hoarding it. It means looking for common functionality and reusing those components for every similar project rather than writing everything from scratch.

So that’s why I’ve made a Freedom of Information Act request for Sutton Council to disclose its code publicly. If accepted, this won’t in itself give anyone else the legal right to reuse it. But it’s a big step in the right direction towards getting the council to live up to the commitment it has already made by signing the Local Government Digital Service Standard, which says:

Use open standards, existing authoritative data and registers, and where possible make source code and service data open and reusable under appropriate licenses.

17 Mar 2018