The Investigating Software Podcast
Diesel-gate, and the testers who found VW’s $33bn feature.

Diesel-gate, and the testers who found VW’s $33bn feature.

July 23, 2020

This is the story behind the VW emissions scandal, that so far has cost the company over $33bn. We look into the technology issues VW faced and the investigations that uncovered the problem. Scroll down for full transcript!

 

Resources used to research and compile this podcast include:

 

Wikipedia: Volkswagen emissions scandal

https://en.wikipedia.org/wiki/Volkswagen_emissions_scandal

 

VW Logo

https://commons.wikimedia.org/wiki/File:Volkswagen_Logo_till_1995.svg

 

IEEE: How They Did It: An Analysis of Emission Defeat Devices in Modern Automobiles

https://ieeexplore.ieee.org/document/7958580

 

Meet the Man Who Brought Down Volkswagen

https://time.com/4119981/the-man-who-brought-down-volkswagen/

 

ICCT: EPA's notice of violation of the Clean Air Act to Volkswagen

https://theicct.org/news/epas-notice-violation-clean-air-act-volkswagen-press-statement

 

WVU: In-Use Emissions Testing of Light-Duty Diesel Vehicles in the United States

https://theicct.org/sites/default/files/publications/WVU_LDDV_in-use_ICCT_Report_Final_may2014.pdf

 

Daniel Lange (DLange), Felix "tmbinc" Domke: The exhaust emissions scandal („Dieselgate“)

https://www.youtube.com/watch?v=d9HJw3AUvGk

 

THE VW NOx EMISSIONS GROUP LITIGATION IN THE HIGH COURT https://www.judiciary.uk/wp-content/uploads/2020/04/VWJudgment-002.pdf

 

James Liang: Rule 11 Plea Agreement 

https://www.justice.gov/opa/file/890756/download

 

U.S. v. Volkswagen, 16-CR-20394

https://www.justice.gov/usao-edmi/us-v-volkswagen-16-cr-20394

 

Exhausted by Scandal: ‘Dieselgate’ Continues to Haunt Volkswagen

https://knowledge.wharton.upenn.edu/article/volkswagen-diesel-scandal/

 

Faster, Higher, Farther: The Inside Story of the Volkswagen Scandal Kindle Edition

https://smile.amazon.co.uk/gp/product/B01N5VDD2I/

 

Volkswagen Diesel Old Wives' Tale 6 Diesel is Dirty

https://www.youtube.com/watch?v=RMFaBXiBaZA

 

Audi Green Police A3 TDI Ad (Super Bowl XLIV 2010)

https://www.youtube.com/watch?v=GemJWrp0nAM

 

 

Show Transcript:

 

Pete Houghton (00:00):
Hello, and welcome to investigating software. In early 2014, the independent council on clean transportation asked a group of testers based at the West Virginia university to do a study on diesel engine emissions. Their budget was tiny, just 50,000 us dollars, In exchange they would locate and comprehensively road test emissions from three European cars that are used in the United States. So unlike previous tests where our scientists testers used the standard lab equipment, while the cars were on rollers, they would take these cars out on the road. They tested the free cars on the actual roads around Southern California, and they even took one car on a road trip up to Washington state. That's a journey of almost 4,000 kilometers in total. Our tester-scientists collected this off cycle road, test data and published it on may the 15th, 2014. The results of those relatively cheap road tests would result as of June, 2020 in over 33 billion us dollars in fines, penalties, financial settlements, and costs for the Volkswagen group.

 

Pete Houghton (01:06):
This is a story of VW diesel gate and the $33 billion feature. Before I get into the technology software and testing and alike, I'll quickly outline what happened over the decade before the EPA called out VW in September, 2015, in 2006 Volkswagon was working on its new diesel engine EA 189 to support a new line of diesel cars. It hoped would boost it's flagging us sales. Historically Volkswagen had some success with it's beetle or bug as it was affectionately called as well as the VW camper van. They had been hippie icons back in the seventies, but post millennium, the VW brand hadn't done so well in North America while other European brands like BMW had made headway at the sort of premium end of the market. So in 2006, the recently promoted chief executive of VW Martin Winterkorn devised a plan to grow global sales from around 6 million cars a year up to 10 million.

Pete Houghton (02:00):It was one of those stretch goals designed to take VW up to the top tier and would involve beating its principal rivals in a top spot like GM and Toyota. The largest car market in the world at this time was the United States. It would be a few years before China overtook the U S as the biggest market. So at that time it made perfect sense that any attempt to boost overall sales would involve trying to boost sales in the U S one of VWs biggest problems was a different focus in the U S markets while their diesel engines sold well in Europe, especially since the perceived fuel economy of diesel helped on a continent with high fuel prices. The U S market had much stricter air quality regulations. And as we will see later, stricter enforcement in the U S market Toyota led the field in fuel economy with the Prius an affordable hybrid car from a manufacturer with a reputation for quality.

 

Pete Houghton (02:49):
Now, this is before the negative publicity for Toyota, from the sudden unintended acceleration publicity starting in around 2009. So that left VW sort of boxed in while its diesel engines were relatively efficient. Why would people switch from buying a Toyota Prius from a trusted mainstream brand? Also, unlike in Europe, diesel wasn't widely used for domestic vehicles in the U S so they'd need to overcome a natural bias from people who'd only ever bought petrol cars, the solution deliver an economical and environmentally friendly vehicle under the motivating banner of clean diesel. Here's an example of a Superbowl advert that epitomizes the marketing strategy. They used diesel was no longer a reason not to get the car. In fact, save the planet.

 

Advert (03:33):
Tragedy strikes tonight where a man has just been arrested for possession of an incandescent light bulb. What do you guys think about plastic bottles now? The water setting is at 105. "Green Police" song... You got a TDI here? Clean Diesel, You're good to go sir! "Green Police" song...

 

Pete Houghton (04:02):
It worked VW saw a dramatic increase in sales from 2010 through 2012, not just in its diesel cars, also in its petrol cars and the perception of their cars as clean and economical help to win over many environmentally concerned customers. In fact, they were still pushing the clean nature of their diesel cars in the second half of 2015. Here's an extract of an advert from their old wives tale series of adverts.

 

Advert (04:29):
How do you like my new car? Isn't diesel dirt say it's beautiful for Christ's sake. I think it's beautiful, but aren't, diesel's dirty. Yeah, that's true. Oh, that used to be dirty. This is 2015. No, no, no. Listen, to me Terry diesel in Latin means dirty. I'll prove it to you. You're going to ruin your scarf. Oh, I'll look what she's doing. See how clean it is.

 

Pete Houghton (04:53):
Interestingly, these adverts were being aired even after the EPA and carb had privately presented the off cycle emissions data that they were worried about to VW and asked them to explain what was going on. Just a quick note. EPA stands for the environmental protection agency in the United States and CARB stands for the California air resources board, sort of local version of the EPA focused on California. So what are these emissions tests that the EPA and CARB do and how come the problem with emissions didn't come to light as soon as the cars were first released or even before then? Well, the car emission testing process is actually quite complex and takes into account the desires of multiple stakeholders. For example, in the US the fuel efficiency and emissions rules have taken into account at various times in their history, the wishes of the United auto workers, that's a labor union in the US to limit the import of smaller, more efficient foreign cars.

 

Pete Houghton (05:50):
The concerns of the national highway traffic safety administration, or NHTSA about the increased risk of from accidents involving some smaller cars, their desire of manufacturers to produce a range of vehicles of different sizes, powers, and class, Oh, and obviously the harmful effects of emissions themselves on people and the wider environment to get a better handle on these, not so simple emissions rules in more detail, let's inspect that second to last rule, the desire of manufacturers to produce a range of different types of vehicles. Let's say you want to get into the car business. You've designed your new car already. It's called the Goliath. It's a big, heavy gas guzzler with all the inefficient extras, but you've got a problem that Goliath is never going to pass those new emissions standards. So what do you do? Luckily, the emissions rules have your back. You only need to meet the standards on average, across all the cars you sell.

 

Pete Houghton (06:44):
Technically it's, what's called a harmonic mean, but it has a similar effect. So the answer is to release so much smaller and more efficient car as well. We'll call it the super mini David. So at the cheap and clean end of the market, we have the super mini David and at the gas guzzling expensive end. We have the Goliath. Luckily for us EPA, won't look at the Goliath's emissions in isolation. They'll take into account that you sold a bunch of smaller, cheaper, and more importantly, cleaner and more efficient. Supermini Davids. That's done on a year by year and per company basis. So each year your company can produce plenty of gas, guzzling, Goliath cars. As long as you get enough, poor environmentalists to buy your supermini David. But unlike the biblical story that our cars are named after our supermini David's are not slaying Goliath, they are literally justifying its existence without the clean efficient car, the inefficient dirty car would just not be legal.

 

Pete Houghton (07:35):
That's the basic idea behind the CAFE or corporate average fuel economy rules in the U S I've simplified them a bit, but the basic principle is true, and it's not just the US that you, you have that sort of similar system in place. Now it's not all bad. The rules do tend to tighten up things like fuel economy and emissions slowly over time. And some would argue the increase in car efficiency or MPG has been at least in part due to these sorts of regulations. So as you can see, regulating and therefore testing emissions can be complicated. Each of those new models of super mini cars has to be certified appropriately to ensure it all balances out overall. And of course the engines are just subject to the laws of thermodynamics, and there's only so much wiggle room a manufacturer has when they develop a new car.

 

Pete Houghton (08:19):
So for finding out what the emissions of your new cars actually are, you need testing, and this is where the US and that you, you differ. They both have similar standards as we've seen, and they have similar concepts of the sort of tests that need to be applied to the cars, but in the U S the testing is done by the relevant state and federal agencies like the EPA or CARB in California in Europe, your car company can go out and hire a favorable company to certify new vehicles. Europe has historically not had the power to ensure companies actually comply with the regulations while the EPA in the US has a history of enforcing auto makers to comply and fining those that don't. But despite the EPA has greater powers and its history of enforcement initially did fail to pick up the issues that would later dog VW. It was this nonprofit called the international council on clean transportation, or ICCT that hired the team of scientist-testers from West Virginia university. A small grant from the ICCT was just enough to enable the team to Jerry rig the mobile test and equipment they needed. And to hire the three vehicles they needed, all three cars were diesel and two were from VW. And the third came from BMW. Here's a clip from John German, the guy at the ICCT who hired the testers.

 

John German (09:32):
I worked for international council on clean transportation. We're a nonprofit research organization. We deal primarily with government regulators worldwide, mostly in developing countries do have an office in Europe. Um, and we... We've been working on this for at least six years. And we've been trying to fill in the holes. There's a lot of reasons why these emissions might be high.

 

Pete Houghton (09:54):
What was different about these tests was that although they would do the usual mix of suburban, rural, uphill, downhill & highway driving as used in the standardized tests, these tests were on the road in normal traffic and in normal weather conditions. When you're out on the road, there's a lot more going on for a start the road isn't necessarily straight, and you have to make the odd turn, the other cars, aren't all moving at the same speed. And your journey might not take the same amount of time. Every time you go out. And also the weather is changeable. For example, the weather up in Seattle was slightly cooler and wetter than the weather they experienced down in Los Angeles. When they were testing their, the standardized tests used by the EPA and CARB were the exact opposite. The rules were published in advance, and they had fixed values of speed, time and distance and standard temperatures and humidity.

 

Pete Houghton (10:41):
So you kind of get the picture about how one was very homogenous and in real life, they were much more variable due to just the everyday conditions. Now it makes sense these tests were published and standardized, at least in part, it would be a bit unfair for the manufacturers. If they had no idea what the tests were going to be. I mean, hypothetically, they might engineer a car that was super clean and efficient for suburban use in say residential middle America, but unbeknownst to them, the test turned out to be all Hill climbs in icy conditions. That wouldn't actually be fair. So hence the standards from a software point of view, these standardized checks would make great unit tests. So you could quickly test the car for serious EPA violations. Every time you made a change, but of course they wouldn't constitute all of your tests.

 

Pete Houghton (11:25):
And of course you can't run the car on the rollers and do an emissions test. Every time you make a code change, the team from West Virginia university are all scientists, and you can tell from seeing their interviews, they aren't idealists. They understand they're dealing with machines, subject to the laws of thermodynamics built by imperfect humans and configured to meet the hurdles they expect to see out in the real world, they do to flee, perform their on-road and exploratory tests using their jury rigged equipment. They'd carefully calibrated the test equipment against the standard dynamometer systems. Back in the workshop, the dynamometer tests are the standard tests done by the EPA, etc, and involve the car running on rollers under sort of control conditions. What these exploratory road tests showed was that two of the three vehicles exceeded the Knox and mission standards between 15 and 35 times in one vehicle and five and 20 times in the other.

 

Pete Houghton (12:14):
Now the testers didn't know why this was the case. In fact, they doubted their own results. They repeatedly calibrated their mobile jury rigged equipment and the cars against the standard dynamometer equipment and subjected their cars to the standardized tests to check. If they still passed, the cars kept passing the standardized tests, but the off cycle or on road tests showed new concerning information. The report from the West Virginia university is a masterpiece. The results are detailed, comprehensive, and full of written information that will help you recreate the same experiments they performed. For example, there are maps of the routes they took, tables showing the speeds the car was going and even details about the time of days that the journeys took place. They also fully detailed the cars under test giving details of the engines, power, size, emissions control technology, and the particular class of emissions standard. The car should belong to. They also recorded the cards, weight drive type, and previous mileage. It's impressive note taking. I mean, they even detailed the weights of the test equipment, they'd loaded into the cars. The report also included the details of the machines used to detect the emissions while the cars were moving. The report also try to match some of their on road tests to those include on the EPA standardized tests. For example, the report states this about one of their tests

 

Report (13:29):
Essentially represents the Los Angeles route four, which was ultimately used in developing the original FTP vehicle certification cycle with some minor modifications at locations where the traffic pattern or roads have changed since the FTPs development.

 

Pete Houghton (13:45):
This is clever. They were not only testing the cars, but they were testing the tests themselves and allowing us to see how the standard dynamometer tests back in the workshop compared with their direct real world counterparts. The engineers reported their findings that basically two of the 3 cars had broken the NOx emission standards by a wide margin when used actually on the roads the third car had generally done. Okay. And had only gone over the limits on a Hill climbing section of the tests. That was the BMW. An interesting point is that the report doesn't specify the manufacturers of the cars tested. It uses the anonymous terms, vehicle A, vehicle B and vehicle C who, if she looked fairly through the document, you could deduce that at least one of the cars was a Volkswagen as there's a reference in the report that refers to some test procedures for the Volkswagen diesel engine, but it isn't mentioned explicitly.

 

Pete Houghton (14:36):
So the team didn't exactly advertise what they worked on. It was pretty low key, and they were pretty clinical about the whole thing. It's this report that provided the initial evidence, that something wasn't quite right with the VW engines. Again, at this point, I'm sure that some people suspected that the cause was nefarious, but the cause could in theory, just have been an accident or a fault that perhaps plagued this particular model of car or these particular versions of those cars. With so few people testing the cars and the car test critically. It's not surprising that the problem had laid hidden for years. Each new car was just subjected to the same standards required for the EPA and its Californian counterpart CARB until the relatively tiny sum of $50,000. That's less than the sticker price of vehicle C in the test was spent by the ICC T to try and test a little better only then did we learn something new and important. So what was actually going wrong in their engines, or when you drive a modern car, we still use the same controls we've used for decades, steering wheels, pedals, and the dials on the dashboard have little changed, but behind the scenes, all those taps to the pedals and now just messages fed into a computer, the computer, hear's your call for more acceleration, let's say, and allows more fuel and air into the engine so far.

 

Pete Houghton (15:51):
So good. That's what we want. But the modern diesel engine can't be just left to its own devices. Unfortunately, diesels on sales are dirty and produce either Soot hazardous to health or NOx, which are oxides of nitrogen gas, also toxic and hazardous to the environment as well. In fact, they will typically produce both and other pollutants like carbon monoxide as well. So things initially look good for diesel in the age of tightening vehicle emissions, but luckily there were solutions to all of the above issues. So firstly, the soot that's just made up of tiny particles, similar to the smoke you get from a garden fire or from cigarettes even it's easy enough to create a filter that will clean out. Most of those particulates. The problem though is like with all filters, eventually it will get clogged up and full. And in this type of filter, that means it's going to become a lot less able to catch the harmful particles and they'll just pass straight through out the exhaust, but clever engineers have a solution.

 

Pete Houghton (16:41):
They can switch the engine into what's called active regeneration mode. This actually happens automatically when you drive the car faster on a motorway or a freeway and in effect, it burns off the particles emptying the filter. Unfortunately though this active regeneration mode actually increases other pollutants like NOx. So what about these NOx gases? How do we solve that problem? Or there are a few options here and I'll run through them. Firstly, there's EGR or exhaust gas recirculation, a fancy name. That just means a percentage of the gas coming out the exhaust is sent back through the engine to be burnt again. That's great for reducing NOx that causes more soot to be produced. Then secondly, there's LNT or Lean NOx Trap. That's like the particle filter I mentioned, but this one is for the toxic gas NOx. It stores the NOx and periodically regenerates itself to get rid of the toxic gas it's already stored.

 

Pete Houghton (17:34):
But this regeneration mode, which is different to the other filter based one, produces more soot, see the pattern. Also this regeneration isn't very efficient. And as the regenerations are every few seconds or at most, every few minutes, it has a noticeable hit on fuel economy. Now this third method is the most effective it's called the selective catalyst reduction or SCR. This involves a special catalyst or converter that converts the toxic NOx gas into water and nitrogen two things that occur just naturally in the air. Anyway. So they're pretty harmless. A downside is that it requires a tiny amount of an additive to be sprayed into the conversion chamber. As the car is being driven. That means we need to squeeze a big tank of this additive into your car somewhere. Also, the additive needs to be topped up every few thousand miles. Another chore for the owner that they obviously wanted to avoid.

 

Pete Houghton (18:22):
They could make the tank bigger for the additive and fill it up as part of a routine servicing. But that means losing some boot space to the bigger tank and that makes the car look less attractive to prospective buyers. Luckily you as a driver, don't have to decide about how and when to use these techniques. The car's engine will automatically just adjust the settings depending on the cleaning technology installed in your car, the engine's computer or ECU constantly reviews the status of the engine, the filters, the speed, the commands you're giving it balancing the needs of fuel economy emissions and things like engine wear what the study by the team at West Virginia university uncovered was that sometimes the balance was just right when they were doing the standardized tests, for example. But sometimes the balance seemed to be way off like when they did the road tests, but remember one vehicle, the BMW didn't do too badly on the road tests.

 

Pete Houghton (19:12):
So it was technically possible to marry the objectives of fuel economy and emissions. It's just that two of the cars didn't seem to be doing that. And a telling sign was that the cleaner BMW use the same emissions technology as used in one of the Volkswagens. They both used the diesel particle filter and the additive consuming selective catalysts reduction techniques. This suggested the cause of the difference. It was not the emissions controlling hardware in theory, but instead the software in the engine that controlled that hardware. So what happened next? Now there was evidence that something at least was a mess EPA and carb notified Volkswagen of the anomaly and asked them to look into it. Now it could just have been a bug. The code and configuration in the engine's control system can have bugs and mistakes. Just like any other app, like an app on your phone or on your desktop computer.

 

Pete Houghton (19:58):
But by then, CARB had done its own test of the Volkswagens and seen similar issues. But again, they didn't know the cause. Was it a fault or was it something more sinister later confessions and investigations, would show that VW engineers had asked Bosch who supplied the engine's computer to use what they labeled and acoustic function. When the car was driven in certain ways, from the point you turn the key, the computer for example, would keep track of the car speed and compare it to its library of known standardized engine tests. If we potted these tests on a graph, they would look like narrow corridors in a maze within which the driver has to keep the car from going either too fast or too slow. The maze have sudden bands and plateaus where the test might simulate the car, stopping at a junction, and then starting up a few seconds later when the lights are changed, when the car hacker an engineer, Felix Domke examined the engine code from the VW computer and overlayed the speeds.

 

Pete Houghton (20:51):
He saw India, acoustic function, and those of these standards emissions test, they matched almost perfectly the acoustic function was what VW had codenamed a defeat device. There's a great, IEEE paper where Felix shows the graphs of the data in the computer, super imposed on all the standard emissions tests you see that they almost perfectly match. So what was happening was the defeat device compared the driver's speed and behavior against a whole selection of standard driving tests. And if it noticed the speed wasn't within the narrowly defined rules in this database, it disabled much of the emissions technology and switched to a dirtier more polluting mode. Why? Because that dirty mode usually involved greater fuel economy reduced wear on the exhaust or in fact reduced use of those exhaust cleaning additives. Remember they don't want people to use quite so much of that. So they don't have to go back into the garage, a chore. They think that buyers won't want to do in late 2015, the EPA issued Volkswagen with its now famous notice of violation. And as the EPA website describes it,

 

EPA (21:54):
The notice alleges that Volkswagen installed software in its model year, 2009 to 2015, 2.0 liter diesel cars that circumvents EPA emissions standards, these vehicles emit up to 40 times more pollution than emission standards allow

 

Pete Houghton (22:12):
Over the following weeks and the inquiries don't deeper. They found issues not only in Volkswagen and Audi, but also Porsche vehicles. Now you may ask why was Porsche affected? Porsche is actually a sister company of Volkswagen along with several others and the companies share the same parts and technology. For example, the Volkswagen Touareg and the Porsche Cayenne are very similar vehicles and share many potty parts and engine components. Yes, including the same diesel engine with VW signature emissions defeat device. As we can see here while reuse can be great for creating a simpler more homogenous production process. That's true in both cars and software, it also means a bug or feature in this case can affect much more than just one machine. As the investigations continued. Other irregularities also came to light with other companies in the VW group, including Skoda and SEAT. While this wasn't the story of a bug, but rather a feature.

 

Pete Houghton (23:03):
It's also the story of some great testing. Our engineers / Tester / scientists at the West Virginia university developed cheap, effective tools and applied them in a way that uncovered massive problems that had placed our health and the environment at risk by spending just a few thousand dollars. ICCT enabled Dan Carder and his team of students and scientists at WVU to test the cars and compare that to the standardized tests, the industry and in particular regulators had been relying on for years. They built the tools needed to investigate and work the problem of how to get realistic emissions readings that enforced the spirit intended by the regulators. The testers also did clever things like try a variety of vehicles that were either different makes or use different emissions technologies. This let them examine the results of differences and gave us important clues about where to focus.

 

Pete Houghton (23:55):
But most of all, they took super notes and with their report detailing every aspect of the investigation and including recreation's steps, they presented the results in a clear and readable format that others could reuse. So how else could this have gone down? Let's imagine an alternate history where the same ethical blunders were made but a wise and profit focused executive had decided to hide their own testers to do much of the same on-road emissions tests. I'll assume a love of people's health, the environment, or a desire, to be honest, did not motivate the executive, but purely profit. This person would likely have found the same results and being rational. She knew something was wrong. If they had done this early enough, it could have steered the company onto a cleaner track cost of development may have risen, let's say an extra billion dollars in costs and maybe lost sales.

 

Pete Houghton (24:43):
Now Forbes has the cost of developing a whole new model car around $6 billion, but we just want to tweak the engine and add a bigger additive tank. So I think a billion would more than cover it. That would still save the company well over $32 billion in fines, penalties, and costs. That's not an environmental choice or a public health choice, or even a moral one. That's a clear cut financial one for product of any complexity, the cost of not finding out what's wrong, of not investigating and testing your products will probably exceed the money you think you're saving by not doing that testing. They were playing Russian roulette with the regulator and eventually everyone loses that game. The reason I mentioned this possible alternate history is back in 2012, a group of engineers noticed that in some customers' vehicles exhausts were failing too soon. The emissions technology was essentially wearing out, the engineers investigated and ultimately uncovered the cause a defeat device.

 

Pete Houghton (25:38):
As it turned out the defeat device didn't always work. Sometimes a driver might end up staying in the defeat mode longer than VW had planned. The car was inadvertently clean. That is the bug, Was that the car didn't cheat enough. The engineers actually appeared surprised at the defeat device they'd found and they convened a meeting with their supervisors and handed over a document outlining what they'd found I'll quote the exhibit two statement of facts from the VW rule, 11 plea agreement, paragraph 48 as to what happened. Next note, this is a statement agreed by VW, not just the U S government, although

 

Plea Agreement (26:15):
They understood the purpose and significance of the software supervisors, A and E each encouraged the further concealment of the software, specifically supervisors A and E each instructed the engineers who presented the issue to them to destroy the document they had used to illustrate the operation of the defeat device software.

 

Pete Houghton (26:36):
So that illustrates my final point. It's unfortunate Volkswagen made a huge ethical blunder. It's almost worse when some of their own engineers debugged a serious issue and uncovered the problem. Their management has shut up. If you want to catch bugs or even inappropriate features, you need a culture that promotes the discovery of problems. It has to be seen as a good thing. And part of the honest practice of engineering and software development, as our heroes at West Virginia university show, it's not particularly expensive and it can save you a fortune. Thank you. I'm Peter Houghton. And you've been listening to Investigating Software.

 

Therac-25, buggy software that killed.

Therac-25, buggy software that killed.

July 5, 2020

I look at the Therac-25 incidents, a devastating collection of software failures that often rank in the top 10 of civilian radiation accidents. The Therac-25 radiation therapy device killed or injured 6 people across Canada and the United States.

I look into the bugs, why the manufacturer didn't fix them and what we can learn from their mistakes. Scroll down for full transcript!

 

Resources used to research and compile this podcast include:

FATAL RADIATION DOSE IN THERAPY ATTRIBUTED TO COMPUTER MISTAKE
https://timesmachine.nytimes.com/timesmachine/1986/06/21/870086.html?pageNumber=50

 

Radiation Therapy for Cancer 1940s Tumor Treated How it Works
https://www.youtube.com/watch?v=CKjEz-9CbgE

 

FATAL DOSE - Radiation Deaths linked to AECL Computer Errors
http://www.ccnr.org/fatal_dose.html

 

Medical Devices: The Therac-25 by Nancy Leveson
http://csel.eng.ohio-state.edu/productions/pexis/readings/submod3/therac.pdf

 

Wikipedia: Therac-25
https://en.wikipedia.org/wiki/Therac-25

 

FDA document outlining the failure of microwave oven interlocks.
https://www.fda.gov/media/75184/download

 

1.21 Gigawatts - Back to the Future
https://www.youtube.com/watch?v=f-77xulkB_U

 

Hamilton Health Sciences:
https://www.hamiltonhealthsciences.ca/about-us/our-organization/our-history/

 

10 Modern Radiation Accidents Involving Civilians
https://listverse.com/2016/02/05/10-modern-radiation-accidents-involving-civilians/

 

Safety-Critical Computing: Hazards, Practices, Standards, and Regulation
https://staff.washington.edu/jon/pubs/safety-critical.html

 

GOOD COMPUTING: A VIRTUE APPROACH TO COMPUTER ETHICS Chapter 6
http://docplayer.net/33270293-Good-computing-a-virtue-approach-to-computer-ethics.html

 

 

Show Transcript:

Pete Houghton (00:01):
Hello and welcome to investigating software. My name is Peter Houghton. It was the 3rd of June, 1985. When Katie Yarborough checked into the Kennestone regional oncology center in Marietta, Georgia. Yarborough was there for followup radiation treatment. After surgeons had removed a tumor a few months earlier, and she needed treatment on the lymph nodes near her shoulder. Patients typically have little if any sensation or sign that the treatment is taking place. And Katie had attended treatment at the center before. So she knew the drill. This time was different. Yarborough screamed in pain and told the machine's operator that he'd burned her shoulder later. The hospital's medical physicist determined that she had received up to a hundred times the expected radiation dose in just that one visit to the oncology center. This is the story of the Therac 25, a state of the art radiation for every device. Katie was his first victim that we know of.

 

Pete Houghton (00:59):
And over the next 18 months, the Therac 25 would kill or seriously injure five more people at the time, the hospital staff didn't know what had happened and the full horror of Katie's injuries didn't come to light until weeks later, when the radiation damage to her shoulder became visible and while suffering in constant pain, she eventually lost the use of her whole arm. Katie Yarborough required several skin grafts to fix the soft tissue damage caused by the machine's malfunction. As you can imagine, the incident worried and kind of puzzled to staff at the cancer center, they considered these machines safe and easy to use shortly after the incident. Tim Still the hospital's medical physicist called ACL medical, the manufacturer of the machine. For some answers, Tim still asked if it was possible. The machine had malfunctioned and incorrectly spread the beam of radiation. A few days later, Atomic Energy of Canada Limited (AECL) Medical called him back and said it was impossible. Now this seems a little overconfident, especially as later, within a few weeks, it was clear that Yarbrough was suffering from radiation burns, but the manufacturer seemed unable to accept that the radiation producing device that had been used on Katie Yarborough had been the cause weirdly, this reminds me of a scene in back to the future. When Marty McFly is back in 1955, And needs some plutonium for his time-traveling DeLorean,

 

Marty McFly (02:23):
All we need is a little plutonium.

 

Doc Brown (02:26):
I'm sure that in 1985 Plutonium was available on every corner drug store, but in 1955, it's a little hard to come by.

 

Pete Houghton (02:33):
I mean, how exactly did they think this 61 year old manicurist would come by? Severe radiation burns in Georgia in 1985, if not through this machine. Within weeks in late July, 1985, another overdose occurred in Hamilton, regional cancer center near Toronto in Canada. This time a 40 year old lady was undergoing treatment. Following cancer of the cervix. The operator tried five times to treat the patient each time receiving a message indicating 'no dose' of radiation had been applied this time. Despite the apparently failed treatment, the patient described a burning sensation. Three days later when the patient returned to hospital. Again, there were clear signs of radiation burns. The hospital took the Therac 25 out of service and brought in technicians from the manufacturer AECL medical to investigate. Now, at this point, I was sort of understanding. I figured, you know, it's a new technology. We didn't really know the dangers of what might go wrong, but that's not entirely the case. Having looked into it. I know that radiation therapy itself was established practice and had been for decades. For example, here's a nurses training video from 1945, 40 years before the accidents,

 

Narrator (03:49):
The nurse technician who administers radiotherapy must have a precise knowledge of the equipment, which she operates. She should also have an understanding of the physiological effects and psychological aspects of x-ray treatments upon individual patients. These qualifications are essential since the technician must cooperate very closely with the doctor nurses and others who are administering the medical and nursing care prescribed for each patient.

 

Pete Houghton (04:17):
So the treatments themselves were not new, but one aspect of the machine was new, it had an all software control system, the machine no longer used electromechanical interlocks to help control the device. All parts of the device were controlled mainly via computer (for old tech geeks. It was a PDP 11 running a bespoke operating system. I'll come back to that later). Those electro mechanical interlocks sound impressive, but they are standard safety devices that you'll even see in your home appliances. For example, when you open your microwave oven, mid nuke, you're using an interlock and interlock, physically shuts down the cooking process to stop you getting cooked as well. You don't get a iradiated because you forgot to click, cancel or end on the fiddly little buttons. The interlock kicks in and stops the radiation Therac 25. Our state of the art machine did a wave of these old fashioned physical interlocks.

 

Pete Houghton (05:11):
And instead use software to determine if it was safe to start treatment. Earlier devices had used physical interlocks to ensure safe use. So even if the user had mistakenly tried to switch on the radiation too soon, nothing would happen. Yes, as you've probably figured out these new software interlocks failed to protect patients, as well as I could have another point to note about this new Therac, 25, is its size, it occupies a whole room and some space outside the room for the operator who actually sits at the computer terminal and types in their prescription details and keeps an eye on the patients inside the room. There was a moveable table where the patient is placed and around them is the arm of the device itself. It kind of looks like a giant KitchenAid food mixer. It can be rotated round into the right and aligned with the patient.

 

Pete Houghton (05:58):
The business end of the machine, sort of like the whisk attachment is rotated into different positions and each position determines what the machine is doing. Either producing electron radiation, x-rays or just a simple light. So the operator can align the beam before treatment. After the second accident in Hamilton, Canada, AECL Medical had sent engineers to take a look and see what was wrong, but the engineers never managed to recreate the problem. Nonetheless, AECL medical suggested a couple of minor code updates that could handle some of the head positioning problems they guessed might have occurred. They thought that the hardware micro switches that detect the positions of the head, the head that determines whether you're using x-rays or electrons or whether you're aligning the head, they thought, those might fail. And so the new code would help them handle those failures more gracefully. Then in a statement that was surely tempting fate, they claimed:

AECL Medical Statement (acted by Pete Houghton) (06:52):
Analysis of the hazard rate of the new solution indicates an improvement over the old system, by at least five orders of magnitude.

 

Pete Houghton (06:59):
They were in essence saying that the machine was now a hundred thousand times as safe. That's quite a statement considering they hadn't been able to recreate the bug in the first place and therefore no evidence that they had fixed anything of consequence. Over the next 18 months the Therac 25 would kill or seriously injured. Four more patients that we know of. This includes a tragic case of a man down in Texas who received a massive overdose to the frontal lobes and died within a month from his injuries. It wasn't until after the second incident at the same site down in Texas, that the FDA declared that for act 25 defective and demanded AECL Medical, come up with a plan to fix it though the machines would remain in use and would kill another patient in early 1987. And now it wasn't until the sixth incident in 1987, that the FDA finally demanded the AECL Medical, tell the hospitals to stop using the device altogether.

 

Pete Houghton (07:56):
So what was going wrong? Was this a case of human error, hardware, failure, or problems in the software? The problems here are manifold. There wasn't just one problem or one bug or one thing that caused this 18 month long tale of death and injury. Let's take a look at two of the bugs that were found in the system,

Pete Houghton (08:14):
The race condition bug. To tell the Therac 25, what type of treatment to give the patient, The operators would use a relatively easy to use application on the systems computer. So it would seem pretty old fashioned to us today. The first mainstream desktop that you might recognize with a mouse, desktop files and windows was released the same year as that Therac 25 in 1983, Apple fans will, of course know the computer was the Apple Lisa the Therac 25, didn't have one of these new fangled desktops and used a more traditional set up called a VT 100, which really just consists of text dumped to the screen in columns where the user could move about the screen using the arrow keys, the medical physicists and operators would just navigate around the screen using the arrow keys and enter the details of the patient's prescription.

 

Pete Houghton (09:06):
For example, the energy level, the type of treatment. So x-rays or electronic therapy, the duration of the treatment and other things to choose x-ray treatment. The user would just enter an X into the appropriate field, and if they wanted to use electron therapy, they would just enter an E into the same field behind the scenes. The system would rotate the appropriate parts of the machine into place. So the light type of treatment would be given. Unfortunately, during this setup, it wasn't detecting the changes had been made by the operator. So several of the incidents appear to have been caused by the following steps. One, the operator would quickly type in the prescription two. They would notice that they had dented an X for x-ray when the patient needed an E for electron that's an easy mistake to make because most patients were receiving the x-ray treatment. Three:

 

Pete Houghton (09:59):
They would then correct the mistake using the up arrow to go back up and enter an E for electronic therapy for the operator would then return to the bottom of the screen to command the machine, to begin the treatment. Unfortunately, Therac 25, stop listening to the new commands for eight seconds. After the operator had entered the original X for x-ray during this eight second time window, the device has rotating the electromagnets out of the way and the x-ray targets into their correct positions for x-ray treatment. So when our quick fingered operator updated the system to use the electron therapy mode, the machine ignored her new request. This left a machine in a sort of inconsistent state, half configured for x-ray mode and half configured for electronic mode. What's worse in normal operation x-ray mode automatically sets the system to use maximum power. Normally the patient is protected from most of the radiation by a sort of lens, absorbs a lot of the radiation as it diffuses the beam out into a wide area on the patient's body.

 

Pete Houghton (11:04):
When the operator started the treatment, the patient was hit by the full unshielded power of the electron beam, and it gets worse. The confusing and misleading messages about no dose having been applied men that the operator sometimes would repeat the process multiple times. So that's the race condition bug a quick and efficient operator who noticed a mistake in the prescription would quickly fix it. That Therac 25, unable to handle the changes ends up delivering a massive overdose of radiation. Sometimes a hundred times the required dose, the manufacturer AECL Medical of Ottawa, Canada repeatedly refused to accept that the machine had any faults, especially ones with such lethal consequences. And after the third incident in Yakima, Washington state, they sent the following response.

 

AECL Medical (acted by Pete Houghton) (11:52):
After careful consideration, we are of the opinion that this damage could not have been produced by any malfunction of the Therac 25 or any operator.

 

Pete Houghton (12:03):
This is after the hospital staff had pointed out to the company that the red radiation burns on the patient matched the pattern of the open trays at the business end of the Therac 25. After the fifth incident, AECL medical were informed by medical physicist, Fritz Hager, that he had managed to reproduce the error that might've resulted in the patients getting an overdose. So it's interesting here, a user figured out the bug and supplied AEC L medical with the details of how it could both state no dose on the readout while massively overdosing the patient. The user clearly had a good handle on how to use the machine. You might imagine a team of AECL experts quickly looking through the code, developing safe and reliable code patches to this problem, hardware fixes and ensuring that no one else is placed in danger until a thourough review had been completed.

 

Pete Houghton (12:58):
Yes, that's exactly what didn't happen. Instead, AECL Medical issued an advisory to customers to remove the up arrow key from their keyboards and to cover the metal contacts under the key with electrical tape, just to make sure users don't click the up arrow and edit the prescription data, thereby avoiding the deadly race condition bug, even the FDA for more was needed and stated on the 2nd of May, 1986,

FDA (acted by Pete Houghton) (13:22):

it does not satisfy the requirements for notification to purchases of a defect in an electronic product specifically. It does not describe the defect nor the hazards associated with it. The letter does not provide any reasons for disabling the Cursor key, and the tone is not commensurate with the urgency.

 

Pete Houghton (13:40):
Furthermore, the manufacturer didn't stop to think if the development and testing process had allowed this first Bug into the system, maybe other bugs were present and a more thorough approach might be needed.

 

Pete Houghton (13:52):
The second bug I'm going to talk about occurred in Yakima Washington state in early 1987, incidentally, it's also fought to be the cause of the earlier incident in Hamilton, near Toronto, that I mentioned earlier, when the operator is getting ready to treat a patient, they often first enter the prescription into the computer, then go into the treatment room to finish aligning the head of that Therac 25. So it points correctly to the tumor or lymph node being treated in this situation. And the machine was in what they called a field light mode. And the operator could make continual adjustments until a patient. And the business end of the machine were perfectly aligned during this process. The computer keeps track of the fact that the system isn't ready to use and that the right heads are not in place for treatment. It does this by just increasing a little counter in its memory.

 

Pete Houghton (14:40):
So the computer is in a little loop going, have I been set up right yet? No. Okay. Then add one to mine. Not ready. Counter is the not ready counter zero. No. Okay. I won't fire the huge beam of deadly radiation just yet. So as long as that not ready counter is above zero. The computer doesn't allow the beam to switch on who's good and safe, but of course not in Therac 25, the problem of increasing account on a computer is that eventually it will reach its maximum possible value and will leave an error or return to zero. This is often described as a rollover. It's a bit like how a clock goes back to zero zero, zero, zero after 23 hours and 59 minutes. At the end of every day, the Therac 25, not ready counter had a maximum value of 255. So while the technician is aligning the machine and the patient, this counter is increasing every time he notices that things haven't finished being set up yet until of course the operator decides everything is in place and wants to proceed to the next step.

 

Pete Houghton (15:44):
The operator then clicks SET and the computer then proceeds to allow them to continue the next stage of the setup. But of course there is a slim chance one in 256 in fact, that our knock ready counter has just returned to zero. And when that Therac 25, so it was zero, essentially saw a green light and applied the radiation beam at full power, even though things weren't properly set up yet. And as a machine wasn't fully set up, there was no diffuser or colimater in place to reduce or shape the beam. The patient was hit with the full power of the beam before the operator had even completed the setup. They were the two high profile examples that were highlighted after the incidents, but they weren't all of the problems, a medical physicist, Tim Still the guy I mentioned earlier that worked in Marietta, Georgia compiled a list of eight other worrying bugs he'd seen in the system.

 

Pete Houghton (16:39):
So who or what caused these problems. As I mentioned earlier, there are many causes. We could point at the code clearly that had a lot of bugs. Also the choice of programming language, which was assembly language rather than the language was easier for colleagues and auditor's to review. But that point is almost moot as there was no external code review, no one outside the company had audited the system before it was deployed into 11 hospitals across the United States and Canada, the software was all developed by one person. I never issue was the overreliance on software for safety instead of tried and tested hardware interlocks, also failing to fully investigate the early problems and then guessing at a fix, which we saw was not the actual issue. They didn't step back and examine the system more deeply. Even given there was evidence of harm. There was obviously serious development management issues at a ECL medical.

 

Pete Houghton (17:34):
For example, while developing the software, the developer decided not to use a standard off the shelf operating system like Unix or one of the many others that were available. He instead wrote his own operating system. This is like developing a new car radio and then deciding that your radio was so special that you just had to design and build a whole new car to plug your radio into. You can just imagine how reliable that car would be. But the problem, I think at the heart of all, this is a denial by the manufacturer that there was a problem they repeatedly denied the software could be at fault. Even after people were getting radiation burns, they assumed all sorts of other causes for the incidents accusing one patient of getting burned by her electric blanket and another of getting electrocuted by faulty hospital, wiring, hospital staff, thoroughly debunked, both of those suggestions, even AECL Medical's initial fix was based on a guess that the hardware micro switches were failing and the software needed to be amended to handle this.

 

Pete Houghton (18:30):
This is when they add no sign that that had ever happened. So they fought their high quality software. I needed a slight modification to handle the imaginary hardware problem they thought had happened. I suspect the engineers involved assumed that they could build software, just like physical machines. That was their background. After all, all the previous machines had been hardware controlled with any computers acting purely as calculation machines for the operator. A good point noted by Nancy Leveson, an academic who did much of the research into the failures is that in a 1983 hazard analysis report, AECL Medical stated.

AECL Medical (acted by Pete Houghton) (19:05):
Two. Program software does not degrade due to where fatigue or we production process. Three. Computer execution errors are caused by faulty hardware components and by soft random errors induced by alpha particles and electromagnetic noise.

 

Pete Houghton (19:21):
... to that last point while technically true certain types of radiation can cause errors.

 

Pete Houghton (19:26):
It's accepted that here on earth, at least it's much more likely that writing flaky code or unreliable code is what's going to be your problem. In my opinion, a ECL medical looked at the problem of determining their system's safety, the wrong way around. They assumed that they had used all the right nuts and bolts and therefore the machine as a whole would be fine. In fact, some of the code had been used before and no one had died on those systems. The difference was, of course those machines had hardware, safety interlocks, and didn't use the software in any fundamental way to control the machine. The manufacturer, didn't assume the software was broken, optimistic, wherever they were looking at the individual software components or the entirety of the system. They treated them as cogs or nuts and bolts. The assumption being that once you plugged them together, right?

 

Pete Houghton (20:10):
And they just work just like real cogs or like Lego software. Isn't like that. It's more like an arcane set of rules for an old board game. You find in the attic, you might get the gist of what's going on by glancing at the rules, but you won't really know until you play it, even then you soon realize that you're just not playing it right. And the game is probably a bit rubbish. And you understand why it's placed in the attic in the first place, a better approach is to assume failure assume that there are serious flaws in the software and that you just need to find them. It's a sort of Pascal's wager. Blaise Pascal was a 17th century mathematician who claimed it was more rational to believe in the existence of God than not. Pascal's wager when something like this. If you were to believe in God and live accordingly, you'll go to heaven.

 

Pete Houghton (20:54):
But if you didn't believe in God, you wouldn't go to heaven. Even if God existed. And conversely, if God didn't exist, you had nothing to lose any way from believing in God. Three pointed out the rational choice is to err on the side of believing in God. So when it comes to reviewing, investigating, or testing software we're trying to find out if the bugs exist, when we take a shallow unthinking, look at the app and see that it's all good. And there's nothing to worry about. Then we might be right. The bugs might not exist, but the more rational choice is to believe that there is a bug and spend your time diligently searching for this truth. Because the significance of finding one bug far outweighs the significance of many attempts to find no bugs, finding that one bug that deletes or corrupts your customers data is far more important than five glances that showed you how marvelous the software was.

 

Pete Houghton (21:40):
It's those bugs that will make people decide to not use your app, your website, or your data model. The next step would be to see the results of your testing. See the bugs and then take action. Not just to fix those bugs, but the underlying causes. For the Therac 25. This might include ensuring that the system was fail, safe, having code reviews and using safety orientated programming techniques that weren't so prone to dangerous failures. But like I've said before, the first step is knowing you've got a problem and that's where software testing can help. Thanks. That's all for this episode. Again, I might return to this series of incidents in a future podcast, as there's just so much that went wrong here. I'd also like to thank professor Nancy Leveson who wrote the initial report that much of the later articles on that Therac 25 failures are based on, it's an excellent insight into what went wrong. 35 years ago, I'll put a link to her report in the show notes. Thank you. You've been listening to me, Peter Houghton, and this was investigating software.

 

 

The Post Office Horizon Scandal

The Post Office Horizon Scandal

June 24, 2020

In this episode, we look at the Post Office Horizon scandal, an app that caused what some people are describing as the largest miscarriage of justice in British legal history.

We look at some bugs, the legal judgements and what might have gone wrong at the Post Office to allow things to go so off track. I analyse what we can learn from the disaster to help stop this from happening in our own projects.

Resources used to research and compile this podcast include:

Judgment and related docs:
https://www.judiciary.uk/judgments/bates-others-v-post-office/

The Great Post Office Trial
https://www.bbc.co.uk/sounds/series/m000jf7j

Panorama : Scandal at the Post Office BBC Documentary 2020
https://www.youtube.com/watch?v=d4UYP8JP61A

Post Office worker 'wrongly jailed for £59,000 fraud caused by computer glitch'
https://www.mirror.co.uk/news/uk-news/post-officer-worker-wrongly-jailed-22173274

The Post Office Horizon IT scandal, part 1 – errors and accuracy
https://clarotesting.wordpress.com/2020/05/27/the-post-office-horizon-it-scandal-part-1-errors-and-accuracy/

House of Commons, Chi Onwurah MP
https://parliamentlive.tv/event/index/a7468777-1570-483c-b245-b72c85e08bb7?in=12:45:40

 

Show Transcript:

Peter Houghton (00:00):
Hello. My name is Peter Houghton and welcome to investigating software. This is a story of a buggy software system that ended up with the prosecution of hundreds of its users and pushed many more into bankruptcy, depression, and ruin. It's the story of the Post Office Horizon system for an idea of the scale of the issues. Here's MP Chi Onwurah.

Chi Onwurah MP (00:21):
Mr Speaker, the Post Office horizon scandal may well be the largest miscarriage of justice in our history, 900 prosecutions, each one, its own story of dreams, crushed careers, ruined families, destroyed reputation, smashed and lives lost, innocent people bankrupted and imprisoned.

Peter Houghton (00:45):
Horizon is an EPOS or electronic point of sale system that's used in every post office across the UK. When you go into a post office to say by some stamps or deposit benefit cheque or even buy insurance, the Post Office teller is using the Horizon system to make those purchases. He also keeps track of stock and updates the branch accounts for every purchase in late 2019, the Post Office settled a class action or group litigation brought by subpostmasters for 58 million pounds in the high court in London, that's over 71 million US dollars or 64 million euros. In fact, the costs on all sides are high we've Justice Fraser, noting that the sides had spent 27 million pounds. This is in legal fees and expenses. And in his words,

Justice Fraser (Acted by Peter Houghton) (01:35):
Both this level and rate of expenditure is very high, even by the standards of commercial litigation between very high value blue chip companies.

Peter Houghton (01:44):
So who are these rebellious subpostmasters? They are the people that run local and often provincial post offices. The sort of post office doubles as a village shop, I would have sold your Sunday newspaper and a pint of milk, birthday cards, that kind of thing. The sub postmasters operated under contract with the Post Office. And so sat in an awkward situation where they were sort of users and also in a sense customers of the horizon system, but technically they're also independent and responsible for their own branch accounts. And it's that point, the fact that they were responsible for balancing the books, but critically didn't have any direct input or control of the tools Horizon, for example, that lies the crux of this scandal. It wasn't until justice Fraser ruled in an earlier trial, that sub postmasters were not liable for the accounts unless the Post Office could prove that they were at fault.

Peter Houghton (02:36):
So what was the lawsuit all about? Well, the judgment and related documents provide a treasure trove of information regarding what went wrong at the Post Office over the previous 20 years in 1999, the Post Office rolled out a new point of sale system Horizon to all post offices. The system had to cost a billion pounds so developed and had been several years in the making. In fact, there were trials of the system as far back as 1995. Soon after the system went live Fujitsu, the supplier who developed horizon and the Post Office received reports of discrepancies and disappear in cash at post offices. I can imagine they sort of expected this. I mean, with any large deployment, especially when it's as complex as this, there's always going to be a few issues that crop up after go live. What's puzzling is how the Post Office treated some of the people that saw problems with Horizon.

Peter Houghton (03:25):
Now the Post Office isn't like other companies, for a start it's state-owned and has its own investigators, and it can prosecute people. It deems to have committed a crime. In fact, back in the cold war, it was involved in the investigation of Russian spies and even took a role in the hunt for the perpetrators of the great train robbery. In 1963. When I first heard about the investigators, I had assumed wrongly that they were a pretty minor operation, the sort of thing that operates out of a small basement office consisting of a couple of retired police, or maybe a sort of a Colombo figure who slowly investigates these odd issues. But in fact, as of 2010, so right in the middle of the horizon scandal, there was 287 investigators in the Royal Mail Group that would investigate issues across the post office and the Royal mail.

Peter Houghton (04:10):
The Royal Mail is the organization actually delivers the letters and parcels, unlike the Post Office, which is the shop where you cash checks or buy stamps. So like every organization, with a lot of hammers or investigators in this case, the Post Office decided to treat a lot of these bugs as nails over the now 20 years, since the first deployment of horizon, hundreds of people have been prosecuted by the Post Office for crimes such as fraud or false accounting. Some ended up in prison and many been financially ruined as the post office, forced them to pay back the money that was allegedly missing from the accounts. The trial judgment provides a fascinating view into the problems with the horizon system and the arguments made by the post office and the subpostmasters as well as fidget. So the documents include a detailed description of the bugs often, including recreation steps and a description of the impact from the post office and the Fujitsu staff who investigated the issues. Here...

Peter Houghton (05:07):
I'm going to outline three of the bugs from the judgment. Now it's only free because if I did all of them, that'd be 29. And that would probably take too long and test your patients. But these three are quite indicative of the sort of issues people were seeing with the application.

Peter Houghton (05:20):
23. Bureau to challenge currency conversion is available over the counter in many post offices. And as you'd expect, they have a big electronic board on the wall that allows you to see the exchange rates you can get for each currency. Unfortunately, these boards didn't always display the exact same values as those used by the horizon system itself. When you buy currency, the differences were rounding errors and were only present in the fifth or sixth significant figure, but obviously the impact becomes more noticeable with larger purchases of currency. The lack of rigor here is typical of the bugs that are listed in the documents.

Peter Houghton (05:56):
In fact, justice, Fraser States this and his judgment about this bug. This plainly shows a complacent, if not lackadaisical attitude to financial precision. The record show that this bug was occurring in 2005, 2006 and 2010 onwards. Interestingly, the post office decided this was not a bug as there was one report of a similar bug that actually turned out to be due to human error, but justice, Fraser notes,

Justice Fraser (Acted by Peter Houghton) (06:27):
The entries above make it clear that there is a bug. The very word chosen by the Fujitsu employee who wrote the two known error logs is 'bug' to see this characterized in submission as there not being a bug and being evidence of human error is not only puzzling but flies in the face of the terms in the Fujitsu to documents. I find that this is evidence of a bug

Peter Houghton (06:53):
4. Dalmellington bug. This was probably one of the worst books and one of the ones that would be most easily caught by a skilled software test engineer in certain circumstances, an old transaction screen might display after I user had logged off and logged on again. When that happened, the old transaction could be displayed with the enter key being enabled by default, a confused or impatient user. I've heard that those exist, might hit the enter key button in an attempt to clear the old transaction away depressingly this didn't clear the screen straight away and actually resulted in the old transaction being repeated. In one case, the sub postmaster hit the button three times causing a total of 32,000 pounds to get transferred. That's the intended eight grand transfer plus an imaginary 24,000 now missing from the branch accounts. As you can see, this is a cascade of errors on the part of Fujitsu and the post office.

Peter Houghton (07:51):
Firstly, the log off and log on process did not correctly clear or even complete the first transaction. Secondly, the transaction was displayed again, defaulted to the enter button. Thirdly, the user is able to send a transaction through multiple times by tapping the keys repeatedly. Now, if you think about it, when you use a modern website, even, you know, not mainstream website, it's quite common for it to block you from pressing a key multiple times, even these kind of small time sites block that functionality because they know it's a risk because they don't want to have to chase up these multiple purchases. This bug was present for six years, between 2010 and 2016. And Fujitsu's own records show that just between 2010 and 2014 alone, they were 93 instances of the bug. And none of these resulted in calls to the support desk by sub postmasters. I suspect that even the sub postmasters were unaware of the bug or were reluctant to call the support line for whatever reason,

Peter Houghton (08:48):
3. Suspense account bug, a suspense account is a sort of temporary table used in to hold information before it's properly assigned. The horizon system was in part an accounting system, as well as just an EPOS or electronic point of sales. And it had its own local suspense account. Unfortunately, this suffered from a bug Gareth Jenkins. The lead engineer from Fujitsu is quoted in the documents as saying.

Gareth Jenkins (Acted by Peter Houghton) (09:14):
The root cause of the problem was that under some specific rare circumstances, some temporary data using calculating the local suspense was not deleted when it should have been. And so it was erroneously reused a year later,

Peter Houghton (09:27):
This somewhat scary bug was present between 2010 and 2013 Justice Fraser goes on to state.

Justice Fraser (Acted by Peter Houghton) (09:36):
One branch had a loss of approximately 9,800 pounds, some were 161 or less and another at a gain of 3,100.

Peter Houghton (09:46):
I could quite literally go on for hours describing the bugs with horizon, but I'll stop there for now. At least. I might return in a later podcast. There's just so many bugs. As always. I'll put a link to the judgment and any other relevant facts in the show notes. So you can look these up yourselves.

Peter Houghton (10:06):
As I suggested at the start, hundreds of people have had their lives ruined by a system, provided to them by what was once a venerable and trusted service. So I guess the question is here, how did what is essentially a dodgy it system result in so many people being prosecuted and ultimately this huge class action costing the post office, millions of pounds. Why didn't they just investigate the bugs, fix them and apologize for the inconvenience. In my opinion, from looking through these documents and hearing the victims, describe their ordeal, it really comes down to a problem in the culture of the Post Office and Fujitsu. They didn't respect their users. And when they reported the same areas over and over to the support lines, they were told by the Post Office that they were the only people that had seen the problem that they were at fault.

Peter Houghton (10:51):
And if they didn't make good, the missing money they'd be prosecuted. I don't think the post office saw the bug reports and confused subpostmasters phone calls as valuable feedback. They maybe saw them have an embarrassing mistake, something to be hushed up or punished this sort of blind allegiance to the ideal of the software. Isn't that rare? I've worked on a couple of companies where the app, or at least the idea of the ideal working app get confused of what the team has actually delivered. One of the best examples of this I've seen is when a large team I worked with hired an external consultancy to do a series of load and performance tests. After finding out that the app couldn't take the load, they started trying various other load tests and different load levels, looking for a working or passing test, understandably. They were looking to see what the app could handle, what can it do at least, but very quickly though, they slipped into a mentality where instead of trying to find the cause of the problems, they were searching for a way to get the load test, to prove that the app was able to handle the load, which of course it couldn't, it's like in their, they were thinking like, what is wrong with this consultancy?

Peter Houghton (11:56):
Why can't they just show it's a solid app? They're clearly no good. In my opinion, having this kind of blindness as well as an existing culture and literal, whole department for prosecuting users of the system, made it all too easy to slip into a us versus them. Culture us good, them bad. As soon as this mindset becomes commonplace, the company stops listening to the information they're getting from the support desk or QA or error logs, the company stalls. And it's just a matter of time before they crash. In my opinion, it's probably very difficult to stand up and say, there was a bug here. We've got it wrong. Let's just fix this at the Post Office. And probably got a lot harder after dozens of people have been prosecuted while this is an extreme example. I don't think that kind of attitude is that rare just in lesser degrees in software and maybe any complex system or bureaucracy, we tend to separate the actual system behavior from what it is meant to do.

Peter Houghton (12:49):
We might think we have a state of the art data processing application, but what we actually have is some lines of code that might not even compile in software investigation and testing we describe what the system does, not what it says on the label, not what we think it should do. We describe what it does do. So how do we avoid this sort of broken mindset? I start by trying to be humble except that we've made mistakes. Even if we have no evidence of those mistakes yet they'll come. In fact, a lack of evidence probably indicates a lack of testing or at least a lack of imagination in those doing the development and testing. This sort of expectation can come from experience. That's often that sort of painful experience. One the hard way by seeing what you thought was a great application, turn out to have many bugs.

Peter Houghton (13:35):
And that can be for people investigating bugs, testing or core developers themselves. That's not to say it couldn't be taught. We often make the mistake of ignoring the things we got wrong, or actually they're one of the best ways to learn. We can actually spread that knowledge in a company. So not everyone has to go through those hard lessons before we learn from those mistakes. What else? I try to reverse the game instead of denying bugs and moaning that they weren't caught earlier, or why weren't they caught in unit testing that genuinely discourages people from owning up to bugs or spending the extra time to almost find about cause you know, I'm only gonna get moaned at, try the opposite approach because remember if we let other people know that these sort of bugs can happen quite easily and how they happen, we can help avoid them in future.

Peter Houghton (14:21):
So not everyone has to go through this pain. So try and end up with your team competing to find issues, to find problems, create a sort of sportsmanlike approach where it's sort of a gentle arms race, polite arms race, where people trying to show that the app is more testable, that we've found some cool new bugs and that we can expect these sort of failures. If we're not careful commend your team when they uncover a mistake or a bug or a risk that way, instead of dreading a bug report and frowning people say, awesome, stick it on the board. You know, this is great stuff. We can fix that tomorrow. Now this isn't a magic recipe, but it's a start. It can help break the downward spiral. That many things are stuck in becoming more focused on quality. Won't divert your team from delivering. In fact, quite the reverse they'll deliver better.

Peter Houghton (15:04):
There'll be more focused bugs, easier to find an easier to fix and easy to test for you and your team are gonna end up having greater faith that the serious issues have been caught and you can ship after a while. Risk becomes something you look for almost subconsciously. You're better able to allocate your time. You write defensive code by default, you use rigorous debugging techniques and you're constantly testing the app before and after go live as well as talking with customers and product owners. There just a few tips that I think maybe the Post Office and other organizations could have used in the early days here helped to avoid a lot of the pain, heartache and financial harm that all sides have come to here. Thank you. You've been listening to investigating software and I'm Peter Houghton.

 

Voting Machine Fail

Voting Machine Fail

June 14, 2020

We wind the clock back to November 2019 and investigate the failure of voting machines in Northampton County, Pa., USA. We break down what went wrong, what caused the problem and what we can learn about the risks of software development from this high profile incident.

Resources used to research and compile this podcast include:

A Pennsylvania County’s Election Day Nightmare Underscores Voting Machine Concerns
https://www.nytimes.com/2019/11/30/us/politics/pennsylvania-voting-machines.html

Press conference with Lamont McClure and Adam Carbiullido from ES&S on the analysis of the voting machines
https://www.facebook.com/CountyExecutiveLamontMcClure/videos/781532772320093/

Recount underway for all Northampton County races after malfunction in voting machines
https://www.wfmz.com/video/recount-underway-for-all-northampton-county-races-after-malfunction-in-voting-machines/video_fb1bb690-b7ff-53e8-baa3-0a624ddc41a5.html

'Human error' blamed for Northampton County election problems
https://www.wfmz.com/news/area/lehighvalley/human-error-blamed-for-northampton-county-election-problems/article_dce6e71a-1d59-11ea-b530-abf45fbb7cbc.html

Not enough voters detecting ballot errors and potential hacks, study finds
https://news.engin.umich.edu/2020/01/new-study-finds-voters-not-detecting-ballot-errors-potential-hacks/

Northampton County Voting System
https://www.votespa.com/readytovote/Pages/Northampton-County-Voting-System.aspx

 

Show Transcript:

Peter Houghton (00:00):
Hello and welcome to investigating software. My name is Peter Houghton. Today. I'm taking you back to the 5th of November, 2019.

TV News Clips (00:08):
Yeah. Election day woes in North Hampton County. We haven't heard anything. I know people have called the County and we haven't heard, I'm going to assume that we'll have to go to a paper count. At some point we can update, now that we have heard through the newsroom or news people confirming that they recount is actually happening.

Peter Houghton (00:25):
We'll look into some bugs that happened with some new voting machines in Northampton County, Pennsylvania. Now, if you don't know where that is, I didn't draw a line due West of New York city and align due North of Philadelphia and where those lines intersect is Eastern. The County seat of Northampton County now in the front of County has a population of 305,000. And on that day it was having an election for a local judge. There's nothing unusual about that happens every couple of years, the only new issue was the introduction of some voting machines and they appear to have caused some controversy and contention in the collation of the results. The voting machines used that day and was supplied by a company called election systems and software or E S and S of Omaha Nebraska. Now it's a fairly established company it's been around for 40 years.

Peter Houghton (01:15):
It's Express Vote XL machines are used all across the country. They have 6300 of them in use at the moment in the United States. Now those machines cost the County about $2.8 million. And that sounds like a lot of money. In fact, it is a lot of money to you and me, but that only equates to about 0.6 0.7% of Northampton counties, annual budget. So a big purchase, but not a huge one by the County standards. For those of you that haven't seen one of these machines, they sort of resemble a large desktop computer, whether they have a large flat screen on the front, it's a touch screen. So voters can just choose who they want to vote for from a list. If it's a simple election with say seven candidates, they can just click on it. It'll put a nice tick next to that.

Peter Houghton (02:02):
And then they can proceed to confirm their vote on the paper ballot. Some of the screens can, depending on the election, be a little more complicated and appear in a grid fashion, more like an Excel spreadsheet, where you have rows and columns and within that are the items you need to select. Now, those screens are programmed either by the local election officials or ESS staff themselves. In this case, the majority of the work was done by ES&S as the election officials themselves had not used the machines before, and they were newly introduced to the area. So we wouldn't expect them to know how to do that. Here's a clip from Votes PA on how to use the new machines.

Votes PA instructor (02:43):
The ballot will display on the screen. You'll Mark your votes by touching anywhere inside the box, around your choice. Once selected, your choice will be highlighted in green. After you finished marking all your choices. Don't forget to review every selection before casting your ballot. You can, de-select a choice by touching it again.

Peter Houghton (03:00):
Now the development of this software really takes two parts, two steps. Firstly, there's the actual development of the application itself working with the hardware, the touch screen. And secondly, there's another stage also referred to as programming by the election officials and ES&S, which is the configuration of that software for use in a particular election. Obviously each election has different candidates and maybe different rules about who can vote or what they're voting for. So if we go back to our Excel example, there's the development of Excel itself, the application that we buy off the shelf or download, and there's the actual work you do on Excel. Often we do that ourselves and we may enter scripts or data into that to produce a particular result. So in this case, the development of the application was done prior to the election and in the days leading up to the election, the programming of the actual candidates and the details of each race that people were voting on/for that day was configured or programmed in ESS terminology into the system.

Peter Houghton (04:04):
So there's really two different areas here. In fact, more than two, if we look from a software investigation or testing point of view, but the two key areas are the initial development and obviously the testing and verification of that software and the subsequent configuration where we take the off the shelf software. And we configure that to make it appropriate for each election. Now, both of those have the potential to produce just what we want, or maybe not quite what we want. They could have bugs or misconfigurations or any number of issues with them. So on the day, the 5th of November, 2019, the leading candidates split along party lines. Now this isn't a County that has a traditionally a mix of left and right, but historically a slight skew towards the democratic party. Now in that election, you didn't have to vote along party lines. It was possible to crossfile that is choose a candidate that represented both parties, no fountain County election officials requested some instructional texts to be placed on the screen to help voters when they are presented with the screen to discern, how would they vote for one of these crossfire candidates rather than voting along party lines?

Peter Houghton (05:15):
And it's that instructional text that appears to be the root of the issue seen on election day. Now, what were those issues?

New Speaker (05:21):
When tabulated the votes got attributed to that instructional text, when we removed the instructional text. So you can see over here, the votes were correctly attributed to the proper candidates.

Peter Houghton (05:35):
One of the candidates, a basis, how don't you received 164 votes across the whole County when he was expected to get many tens of thousands, this looked a little suspicious. So they actually decided to disregard the automatically collated results from ESNs systems and go and start looking at the actual paper ballots are printed as a, each time a user makes a vote. Also many people were reporting that the screens didn't actually select what they had chosen on the screen. So for example, someone clicked on a Republican candidate and the democratic candidate had been highlighted or vice versa.

Peter Houghton (06:12):
Now it's interesting that the election officials decided to assume that because people could have verified their vote using the paper ballot on the side of the machine, that they actually did do this because a few months later in January, 2020, the university of Michigan actually published a report in which it said that 93% of voters missed incorrect information on a ballot. Now that doesn't prove that the exact same issue happened in this university of Michigan study as happened in North Hampton County. But it does make you think that if there was widespread misattribution caused purely by the screens that maybe that should have been more serious issues, something they could have considered more seriously, and maybe we run the race. Now they don't seem to join the dots here and think maybe the fact that the screens were misbehaving and people were clicking on one candidate and another was being highlighted, might have actually contributed to this issue.

Peter Houghton (07:08):
They seem to want to separate those two bugs. Now I can see why you'd want to do that. The problem with the electronic storage, not correctly tabulating the data is one that's easily remedied by a simple, albeit time consuming process of manually reviewing each of the ballots. But the problem of the screens it's much harder to rectify. You don't know what the person was actually clicking on. There could have been clicking on a Republican, a Democrat across filing or any other part of the screen. So you can't go back and retrospectively fix the data to find the correct answer. Now that's one of the things that wasn't really mentioned in any of the news reports, but if you go back and look at the original recording, which is fairly low quality, unfortunately the ESS representative actually mentioned that this particular type of configuration hadn't been tested,

Adam Carbullido (07:58):
I want to make clear that this was human error and he SNS takes full accountability. Okay. He was for North Hampton County was untested and issue should have been identified by ESS staff, correct prior to the election and during collection test.

Peter Houghton (08:15):
So this goes back to the point we made earlier, where there's two stages to the development of this system. The original system is developed back in Nebraska, and it's basically a menuing system with tabulation. And when they come to actually deploy it for a particular county or state, they reconfigure that system in another stage that sometimes referred to as programming. But it's more of a configuration and onsite testing process. It's that stage that wasn't tested. At least it was tested, but not this particular configuration. And hence the issue becomes a problem on the day. Now, this is kind of interesting because it brings me in mind of Conway's law, for those who don't know, Conway's law was coined by Malcolm way back in the 1960s. And it goes like this. Any organization that designs a system we'll produce a design who structure is a copy of the organization's communication structure and why I think that's relevant here is it appears that there's two teams, right?

Peter Houghton (09:14):
One team does the development and testing back in Nebraska. And there's other people, maybe a different team that come in later on and configure that for deployment in each state or county election. And because of that, there's this sort of miscommunication between the teams. The two teams have different views of the software. They have different concerns. And in this case, they appear to have different understandings of what the software can be used to do and how that's been tested. Now, another interesting point that Lamont McClure the county executive States in his press conferences is that the backup system worked the paper ballots allowed for an accurate confidence building election to take place. People could look at those paper ballots and verify for themselves. And indeed the election officials did verify that a certain number of votes have been applied for each of the candidates. Now that's good.

Peter Houghton (10:04):
That was a backup, but you can't rely on a backup. And when you start relying on a backup, you essentially don't have a backup anymore. What you've got then is just how you do things. A good comparison is aircraft. If you've got a passenger plane with 200 passengers and one of your engines fails, you don't keep flying the pilot. Doesn't turn around to the crew and go just, just chillax. Right? Everyone just take it easy. You've got two of these things. We'll just keep going. We'll be there in a few hours. He doesn't, he lands the plane, they do a long, hard investigation into what went wrong and they try to make sure it doesn't happen again. So hope that's what ES&S are doing here. They've actually gone back and not only fix this particular issue, allowing for crossfire candidates or restricting the software from doing that sort of behavior, they've actually maybe gone back and looked, why didn't we find that?

Peter Houghton (10:51):
Okay, we've got these two teams. Maybe they should both have more testers or developers more suited to this sort of testing or maybe some new tools or processes they can use to help raise the bar of their quality because whatever they were spending before is probably quite small in comparison to the amount of damage done by the bad publicity here. And this is where it comes down to exposure. Now it's often a mistake to judge the amount of money you're spending on testing your software by how much you're spending per day. So the cost per day might be X number of thousand dollars. You're not taken into account there. The cost of failure testing as a sort of insurance against failure. It's not going to catch all the issues, but it's going to help you reduce the number of issues that go live and cause these sort of embarrassing incidents.

Peter Houghton (11:37):
So when you're comparing the amount you spend on testing, you're really going to be comparing the amount you could make from the software against how much you could lose. If you don't have the appropriate testing in place. If one of these issues gets out there and get becomes a nationally reported incident on television, it's in the New York times, they're the costs you have to look at. Typically the cost per day are fairly low compared to the sort of ultimate costs. You'll see if you don't test your software fully, of course, what do we mean by fully? Well, it's up to you, right? Depends on your market. Maybe your exposure is low. Maybe you can shift broken software out there. And the impact for you, at least in the short term is quite small. Some companies can do that, particularly startups, but not everyone can do that.

Peter Houghton (12:21):
And even if maybe you don't actually lose much money directly, for example, you're not shipping a product that people won't buy anymore and you don't get that profit margin. You might have software that other people use Facebook, for example, has a library that lots of people use and it isn't used in their typical applications daily run. But if someone were, for example, to want to share something on Facebook, it would use that library. Now there was a problem short while ago where people had basically catastrophic failures in their applications, lots of different things were failing like Spotify, for example. And the reason was there was a flaw in Facebook's software. Now that's fine. Facebook didn't lose any money directly, but longer term people are going to start thinking, well, do we have to include that Facebook software in our application in the same situation is going to happen here.

Peter Houghton (13:05):
They may still keep their short term contracts for voting machines, but longer term people are gonna start thinking, well, maybe, you know, we won't choose that option next. We'll choose a different company or a different technology. Now this issue of exposure and software quality in general is potentially huge in our society. Now let's just take this particular type of application, nothing in particular to do with ES&S. But any company that produces this sort of equipment, what if there is a similar issue in November, 2020? Now, if that happens, that could cause civil unrest and the situation could be worse. If, for example, one of the candidates only receives slightly fewer votes than the other candidate. And that's due to a particular error. You may not find that error straight away in this case, the election officials could see exactly what was wrong, but at least see that there was a problem and needed to investigate further.

Peter Houghton (13:57):
If the issue was more subtle, it may not come out straight away. It may come out in an audit later, in which case people are more likely to ascribe a more nefarious cause to this. They're not going to think, Oh yeah, machine failed. Let's get them fixed. So they're going to think, what have you done? Who did this? Why did they allow this to happen? Why did this particular party win? And what's more troubling is that maybe the actual issues that will cause this sort of slight variation in the results, this sort of minor bug that won't be noticed straight away. It's still in the system. These sorts of systems have been tested, you know, back in the office and also to a certain extent onsite in regional elections. But that doesn't mean that everything's been tested yet. The kind of low hanging fruit have already been found more subtle issues, ones that are maybe going to come out in the longer term haven't been found yet. And these are the sort of things that will come back to bite. These are the things that ultimately raise the cost of your software, not through initial development and testing by the longterm liability is the exposure you and your project have. That's all for this podcast. Thank you. And you've been listened to investigating software.

 

 

Play this podcast on Podbean App