A counter overflow bug happens when an unsigned integer variable storing a counter reaches its maximum value; when its already at the maximum, as soon as one more value gets added to it, the variable will reset back to 0 and continue counting up from there. If the rest of the software is not expecting this counter reset, it can result in system failures. So the longer the system is running for, the nearer the system will get to a counter overflow event.
Clock drift bugs happen when an internal timer/clock gets calculated incorrectly, causing it to slowly drift out of sync with real time. The longer the system is running, the more the clock drifts and the bigger the error gets. When real time clocks are out of sync, all sorts of downstream calculations can become inaccurate.
Both of these types of bugs typically have a workaround that requires the operator to reboot the system every X hours or days.
Both counter overflows and clock drift bugs can occur in production systems when system testing didn't include a "soak test" - where you keep the system running for a very long period, typically measured in days rather than hours. The failure to pick up the bug during system testing means that a production system gets deployed with a bug, and the only workaround becomes a regular reboot of the system.
The kicker - these type of bugs have been happening in safety critical systems for at least the past 28 years
One of the most publicized cases was the failure of a Patriot missile defense system in 1991, in which the system failed to track an incoming Scud missile, due to a drifting and out of sync clock. Tragically, this particular failure resulted in the death of 28 U.S. soldiers based at a U.S. barracks near to Dhahran, Saudi Arabia. The workaround for this bug was to reboot the system every 8 hours, but unfortunately this had not been communicated to the base in time to avoid the disaster.
Another well known case was the counter overflow in the Boeing 787 Dreamliner firmware (2015), which meant the aircraft had to be rebooted once every 248 days to avoid a counter overflow bug.
The most recent case was an internal timer bug in the Airbus A350 firmware (2019), which requires that the aircraft be rebooted once every 149 hours.
Avoiding bugs in safety critical systems
Avionics systems are among the most complex things ever built by man. There's one leading example of a successful avionics project which was built from scratch targeting high quality and zero defects. Its the Boeing 777 - Boeing's first fly-by-wire aircraft.
For the 777, Boeing decided early on to standardize on the Ada programming language, at a time when using C was the norm. Compilers for Ada had been certified as correct, and the language itself included several safety features, such as design by contract and extremely strong typing.
Ada itself came about as a way to standardize all of the programming languages in use by the U.S. Department of Defense - before Ada they were using some 450 different programming languages. Ada was originally designed to be used for embedded and real time systems.
Ronald Ostrowski, Boeing's director of Engineering, claimed that the Boeing
777 was the most tested airplane of its time. For more than 12 months before its maiden flight, Boeing tested the 777's
avionics and flight-control systems continuously - 24hrs - 7days - in laboratories
simulating millions of flights.
The 777 first entered commercial service with United Airlines on 7th June, 1995. It has since received more orders than any other wide-body airliner. The 777 is one of Boeing's best-selling models; by 2018 it had
become the most-produced Boeing wide-body jet, surpassing the legendary Boeing
747.
Further reading - Boeing flies on 99% Ada
Follow @dodgy_coder
No comments:
Post a Comment