Really well-tested software
Everybody writing software wants it to perform according to its specification and to be reliable. I hope that this is a safe assumption. There are 3 things to be done to achieve these goals: design well; implement carefully; test thoroughly. None of these are optional. A careful blend of all 3 makes sense, but, even then, things can go wrong …
There are various design methodologies that enable a specification to be “translated” into a code/system design. I will not dwell on that now – that is a big subject. Various implementation guidelines are available to assist with the implementation of reliable code – MISRA C is an example that I am familiar with. Testing methodologies tend to boil down to figuring out all the possible use cases and failure modes and verifying that they have been addressed.
Imagine that a device is implemented with these concepts in mind. After thorough testing, that including running the device for a few thousand hours, it is marketed and shipped. For some time, it appeared that all the hard work to get it right had succeeded. Users were happy and the orders were flowing in, with minimal technical queries. Everything was fine until nearly 4 years into the lifetime of the product, when suddenly fault calls started to be received …
What kind of fault could lie dormant for so long and then manifest itself in most, or maybe all, devices? As a software engineer, my first thought is “hardware”. Something has worn out. A 4-year lifetime for a component seems unlikely, but not impossible. Most electronic devices have quite long lifetimes, if their operating parameters [temperature, current etc.] are not exceeded. One exception, that is easily forgotten, is flash memory.
Flash is used in many modern embedded devices. There are 3 ways that it is commonly used:
- Storage of program and data, which may be copied to RAM on startup.
- Storage of persistent data, like configuration/set-up information, that needs to survive power cycles.
- Storage of a RAM image when the device uses a hibernate-style power saving mode.
All of these are acceptable implementations. (1) is very unlikely to cause problems, but (2) and (3) can be troublesome. The difficulty is that flash memory has an access restriction: it can be read an indefinite number of times, but will only tolerate a certain, finite number of erase/write cycles. There are various “load leveling” algorithms, but, ultimately, flash memory has a predictably limited lifetime.
Implementation (2) is fine, if the parameters are not changed and stored too often. However, if they are updated too frequently [by design or as a result of an error], failure is inevitable. This is a problem that Tesla reportedly experienced recently. Implementation (3) is more tricky, as the frequency of hibernation may be user dependent and hard to estimate, despite it being part of a power management scheme; some users leave the device switched on, whereas others turn it off when they are done.
Are there pure software errors that can result in this kind of “time bomb”? The answer is yes – numerous. Many of them can circumvented by using coding standards like MISRA C. For example, it may take a long time for heap space to become fragmented, leading to a memory allocation failure. This can be avoided if the advice to avoid dynamic memory allocation is followed.
Other aspects of code design can lead to trouble. For example, maybe your device needs to log the number of hours since it was first “activated”. If this number were stored in an unsigned 32-bit variable it would be fine for about half a million years.That seems like overkill. Someone who wanted to save memory might think that a 16-bit variable would suffice. It is quite likely that they would be lazy and use an int, which, being signed, is effectively 15-bit. A count that can go to 32K would run out in just under 4 years … [This has really happened.]