Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

answer the following questions: What are your moral obligations as professionals to prevent such things from happening? What guidance do the professional codes provide (cite

answer the following questions:

  1. What are your moral obligations as professionals to prevent such things from happening?
  2. What guidance do the professional codes provide (cite specific code sections in your discussion)?
  3. A businesss governance is supposed to protect the business and its employees from having to make difficult choices like those in the article, but what happens if the governance is weak or nonexistent? What is your responsibility then?
image text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribed out the life of the system. If any design nonmedical systems. We must learn 10. N.G. Leveson. "Software Safety in Emor operating parameters change af ter from our mistakes so we do not repeat bedded Computer Systems." Comm. government approval, the AFAR must them. be updated to include all changes affecting safety. Unfortunately, the Air Force program is not practical for commercial systems. Acknowledgments However, government agencies might require manufacturers to provide similar information to users. If required for f everyone,competitive pressurestowith- cluded in this article and in revicwing and hold information might be lessened. ing such information actually increases commented on adratt of the articie. Finally, the refcrces, scveral of whom were appar- Nancy G. Leveson is Boeing professor of emphasis on safety can be turned into a entlyintimately involved in some of the acci- Computer Science and Engineering at the competitive advantage. dents. were also very helpful in providing Liversity of Washington. Previously. she additional information ahout the accidents. was a professor in the Information and Com puter Science Department at the University of California. Irvinc. Her research interests An Investigation of the Therac-25 Accidents Nancy G. Leveson, University of Washington Clark S. Turner, University of California, Irvine omputers are increasingly being introduced into safety-critical systems and, as a consequence, have been involved in accidents. Some of the most widely cited sof tware-related accidents in safety-critical systems involved a computerized radiation therapy machine called the Therac-25. Between June 1985 and January 1987, six known accidents involved massive overdoses by the Therac-25 - with resultant deaths and seriousinjuries. They have been described as the worst series of radiation accidents in the 35 -year history of medical accelerators. 1 With information for this article taken from publicly available documents, we present a detailed accident investigation of the factors involved in the overdoses and the attempts by the users, manuf acturers, and the US and Canadian governments to deal with them. Our goal is to help others learn from this experience, not to criticize the equipment's manufacturer or anyone else. The mistakes that were made are not unique to this manuf acturer but are. unfortunately, fairly common in other safety-critical systems. As Frank Houston of the US Food and Drug Administration (FDA) said, "A significant amount of software for life-critical systems comes from small firms, especially in the medical devicc industry; firms that fit the profile of those resistant to or uninformed of the principles of either system safety the Therac-25 medical common belief that any good engineer can build sof tware, regardless of whcthcr electron accelerator he or she is trained in state-of-the-art software-engineering procedures. Many accidents reveals a software-engineering and safety-engineering perspective. previously unknown Most accidents are system accidents; that is, they stem from complex interacdetails and suggests accident is usually a serious mistake. In this article, we hope to demonstrat the detais and suggests complex nature of accidents and the need to investigate all aspects of system Ways to reduce risk in development and operation to understand what has happened and to prevent the future. future accidents. Despite what can be learned from such investigations, fears of potential liability 18 00189162/93/07000018803.00 19931 EEE COMPUTER of software engineering, safety engineer- computer control using a DEC PDP 11 ing, and government and userstandards minicomputer. and oversight. Software functionality was limited in both machines: The computer merely added convenience to the existing hard- Genesis of the ware, which was capable of standing Therac-25 alone. Industry-standard hardwaresafety features and interlocks in the underlying machines were retained. We know Medical linear accelerators (linacs) that some old Therac-6 software rouaccelerate electrons to create high- tines were used in the Therac-20 and energy beams that can destroy tumors that CGR developed the initial softwith minimal impact on the surrounding ware. healthy tissue. Relatively shallow tissue The business relationship between is treated with the accelerated electrons; AECL and CGR faltered after the Therto reach deeper tissue, the electronbcam ac-20 effort. Citing competitive presis converted into X-ray photons. sures, the two companies did not renew In the early 1970s, Atomic Energy of their cooperative agreement when Canada Limited (AECL) and a French scheduled in 1981. In the mid-1970s, company called CGR collaborated to AECL developed a radical new "doubuild lincar accelerators. (AECL is an ble-pass" concept for electron acceleraarms-length entity. called a crown cor- tion. A double-pass accelerator needs poration, of the Canadian government. much less space to develop comparable Since the time of the incidents related in energy levels because it folds the long this article. AECL Medical, a division physical mechanism required to accelof AECL, is in the process of being erate the electrons, and it is more ecoprivatized and is now called Theratron- nomic to produce (since it uses a magics International Limited. Currently, netron rather than a klystron as the AECL's primary business is the design energy source). and installation of nuclear reactors.) Using this double-pass concept. The products of AECL and CGR's co- AECLdesigned the Therac-25, a dualoperation were (1) the Therac-6, a 6 modelinearaccelerator that can deliver million electron volt ( MeV) accelerator either photons at 25MeV or electrons capable of producing X rays only and, at various energy levels (see Figure 1). later, (2) the Therac-20, a 20-MeV dual- Compared with the Therac-20, the Thermode ( X rays or electrons) accelerator. ac- 25 is notably more compact, more Both were versions of older CGR ma- versatile, and arguably easier to use. chines, the Neptune and Sagittaire, re- The higher energy takes advantage of spectively, which were augmented with the phenomenon of "depth dose": As the energy increases, the depth in the dent protective circuits for monitoring were doneindependently, startingfrom body at which maximum dose buildup electron-beam scanning, plus mechani- a common base." Reuse of Therac- 6 occurs also increases, sparing the tissue cal interlocks for policing the machine design features ormodulesmay explain above the target area. Economic advan- and ensuring safe operation. The Ther- some of the problematic aspects of the tages also come into play for the cus- ac-25 relics more on software for these Therac-25 software (see the sidebar tomer. since only one machine is re- functions. AECL took advantage of the "Therac-25 software development and quired for both treatment modalities computer'sabilitiestocontrol andmon- design"). The quality assurance manag(electrons and photons). itor the hardware and decided not to er was apparently unaware that some Several fcatures of the Therac-25 are duplicateall the existinghardwaresafe- Therac-20 routines werc also used in important in understanding the acci- ty mechanisms and interlocks. This ap- the Therac-25; this was discovered after dents. First, like the Therac-6 and the proach is becoming more common as a bug related to one of the 1 'herac-25 Therac-20, the Therac-25 is controlled companies decide that hardware inter- accidents was found in the Therac-20 by a PDP 11. However, AECL designed locks and backups are not worth the software. the Therac-25 totake advantage of com- expense, or they put more faith (per- AECL produced the first hardwired puter control from the outset; AECL haps misplaced) on software than on prototype of the Therac-25 in 1976, and did not build on a stand-alonc machine. hardware reliability. the completely computerized commerThe Therac-6 and Therac-20 had been Finally, some sof tware for the ma- cial version was available in late 1982. designed around machines that already chines was interrelated or reused. In a (The sidebars provide details about the had histories of clinical use without com- letter to a Therac-25 user, the AECL machine's design and controlling softputer control. quality assurance manager said, "The warc, important in understanding the In addition. the Therac-25 software same Therac-6 package was used by the accidents.) has more responsibility for maintaining AECLsoftware people when they start- In March 1983, AECL performed a safety than the software in the previous ed the Therac- 2.5 software. The Therac- safety analysis on the Therac-25. This machines. The Therac-20 has indepen- 20 and Therac-25 software programs analysis was in the form of a fault tree Therac-25 software development and design We know that the software for the Therac-25 was devel- AECL claims proprietary rights to its software design. oped by a single person, using PDP 11 assembly language, However, from voluminous documentation regarding the acover a period of several years. The software "evolved" from cidents, the repairs, and the eventual design changes, we the Therac-6 software, which was started in 1972. According can build a rough picture of it. to a letter from AECL to the FDA, the "program structure and. The software is responsible for monitoring the machine certain subroutines were carried over to the Therac 25 status, accepting input about the treatment desired, and setaround 1976." ting the machine up for this treatment. It turns the beam on Apparently, very little software documentation was pro- in response to an operator command (assuming that certain duced during development. In a 1986 internal FDA memo, a operational checks on the status of the physical machine are reviewer lamented. "Unfortunately, the AECL response also satisfied) and also turns the beam off when treatment is seems to point out an apparent lack of documentation on completed, when an operator commands it, or when a malsoftware specifications and a software test plan." function is detected. The operator can print out hard-copy The manufacturer said that the hardware and software : versions of the CRT display or machine setup parameters. were "tested and exercised separately or together over The treatment unit has an interlock system designed to remany years." In his deposition for one of the lawsuits, the move power to the unit when there is a hardware malfuncquality assurance manager explained that testing was done tion. The computer monitors this intertock system and proin two parts. A "small amount" of software testing was done vides diagnostic messages. Depending on the fault, the on a simulator, but most testing was done as a system. It computer either prevents a treatment from being started or, appears that unit and software testing was minimal, with if the treatment is in progress, creates a pause or a suspenmost effort directed at the integrated system test. At a Ther- sion of the treatment. ac-25 user group meeting, the same quality assurance man- The manufacturer describes the Therac-25 software as ager said that the Therac-25 software was tested for 2,700 having a stand-alone, real-time treatment operating system. hours. Under questioning by the users, he clarified this as The system is not built using a standard operating system or meaning "2,700 hours of use." executive. Rather, the real-time executive was written espeThe programmer left AECL in 1986. In a lawsuit connected cially for the Therac-25 and runs on a 32K PDP 11/23. A with one of the accidents, the lawyers were unable to obtain preemptive scheduler allocates cycles to the critical and information about the programmer from AECL. In the depo- noncritical tasks. sitions connected with that case, none of the AECL employ- The software, written in PDP 11 assembly language, has ees questioned could provide any information about his edu- four major components: stored data, a scheduler, a set of cational background or experien . Although an attempt was critical and noncritical tasks, and interrupt services. The made to obtain a deposition from the programmer, the law- stored data includes calibration parameters for the accelerasuit was settled before this was accomplished. We have tor setup as well as patient-treatment data. The interrupt roubeen unable to learn anything about his background. tines include and apparently excluded the software. For "Computer selects wrong mode," a the responses from the manufacturer, According to the final report, the anal- probability of 4109 is given. The government regulatory agencies, and ysis made several assumptions: report provides no justification of ei- users. ther number. (1) Programming errors have been Kennestone Regional OncologyCen. reduced by extensive testing on a hardware simulator and under field conditions on A ter. 1985. Details of this accident in teletherapy units. Any residual software ACcidenthistory Marielta, Georgia. are sketchy since it errors are not included in the analysis. was never carefully investigated. There (2) Programsoftware doesnot degrade Eleven Therac-25swere installed:five was no admission that the injury was duc to wear, fatigue. or reproduction in the US and six in Canada. Six acci- caused by the Therac-25 untillong after process. Computer execution errors are dents involving massive overdoses to the occurrence, despite claims by the caused by faulty hardware components patients occurred between 1985 and patient that shehadbeen injured during and by "soft" (random) errors induced by 1987. The machine was recalled in 1987 treatment. the obvious and scvere radialpha particles andelectromagnetic noise. for extensive design changes, including ation burns the patient suffered, and hardware safeguards against software the suspicions of the radiation physicist The fault tree resultingfrom thisanal- errors. involved. ysis does appear to include computer Related problems were found in the After undergoing a lumpectomy to failure, although apparently, judging Therac-20 software. Thesewere notrec- remove a malignant breast tumor, a 61 fromthese assumptions, it considersonly ognized until after the Therac-25 acci- year-old woman was recciving followhardware failures. For example, in one dents because the Therac-20 included upradiationtreatment to nearbylymph OR gate leading to the event of getting hardware safetyinterlocks and thus no nodes on a Therac-25 at the Kennethe wrongenergy, a box contains "Com- injuries resulted. stone facility in Marietta. The Theracputer selects wrongenergy" and a prob- In this section, we present a chro- 25 had been operating at Kennestone ability of 1011 is assigned to this event. nological account of the accidents and for about sixmonths; other lherac-25s - a clock interrupt service routine, - The housekeeper task takes care of system-status in- - a scanning interrupt service routine, terlocks and limit checks, and puts appropriate messages - traps (for software overflow and computer-hardware- on the CRT display. It decodes some information and generated interrupts), - power up (initiated at power up to initialize the system and pass control to the scheduler), Noncritical tasks include - treatment console screen interrupt handler, - treatment console keyboard interrupt handler, - Check sum processor (scheduled to run periodically). - service printer interrupt handler, and - Treatment console keyboard processor (scheduled to - service keyboard interrupt handler. run only if it is called by other tasks or by keyboard interrupts). This task acts as the interface between the software The scheduler controis the sequences of all noninterrupt and the operator. events and coordinates all concurrent processes. Tasks are - Treatment console screen processor (run periodically). initiated every 0.1 second, with the critical tasks executed This task lays out appropriate record formats for either disfirst and the noncritical tasks executed in any remaining cy- plays or hard copies. cle time. Critical tasks include the following: Service keyboard processor (run on demand). This task arbitrates non-treatment-related communication between - The treatment monitor (Treat) directs and monitors pa- the therapy system and the operator. tient setup and treatment via eight operating phases. These - Snapshot (run periodically by the scheduler). Snapshot are called as subroutines, depending on the value of the captures preselected parameter values and is called by the Tphase control variable. Following the execution of a partic- treatment task at the end of a treatment. ular subroutine, Treat reschedules itself. Treat interacts - Hand-control processor (run periodically). with the keyboard processing task, which handles operator Calibration processor. This task is responsible for a console communication. The prescription data is cross- package of tasks that let the operator examine and change checked and verified by other tasks (for example, the key- system setup parameters and interlock limits. board processor and the parameter setup sensor) that inform the treatment task of the verification status via shared It is clear from the AECL documentation on the modificavariables. tions that the software allows concurrent access to shared - The servo task controls gun emission, dose rate (pulse- memory, that there is no real synchronization aside from repetition frequency), symmetry (beam steering), and ma- data stored in shared variables, and that the "test" and "set" chine motions. The servo task also sets up the machine pa- for such variables are not indivisible operations. Race conrameters and monitors the beam-tilt-error and the ditions resulting from this implementation of multitasking flatness-error interlocks. played an important part in the accidents. The operator interface In the main text, we describe changes made as a result and some merely consisted of the word "malfunction" folof an FDA recall, and here we describe the operator inter- lowed by a number from 1 to 64 denoting an analog/digital face of the software version used during the accidents. channel number. According to an FDA memorandum writThe Therac-25 operator controls the machine with a ten after one accident DEC VT100 terminal. In the general case, the operator po- The operator's manual supplied with the machine does sitions the patient on the treatment table, manually sets not explain nor even address the malfunction codes. The the treatment field sizes and gantry rotation, and attaches [Maintenance] Manual lists the various malfunction accessories to the machine. Leaving the treatment room, numbers but gives no explanation. The materials provided the operator returns to the VT100 console to enter the pa- give no indication that these malfunctions could place a tient identification, treatment prescription (including mode, patient at risk. energy level, dose, dose rate, and time), field sizing, gan - The program does not advise the operator if a situation tryrotation,andaccessorydata.Thesystemthencom-patientaresaturated,thusarebeyondthemeasurementexistswhereintheionchambersusedtomonitorthe pares the manually set values with those entered at the limits of the instrument. This software package does not console. If they match, a "verified" message is displayed appear to contain a safety system to prevent parameters and treatment is permitted. If they do not match, treatment being entered and intermixed that would result in excessive is not allowed to proceed until the mismatch is corrected. Figure A shows the screen layout. An operator involved in an overdose accident testified When the system was first built, operators complained that she had become insensitive to machine malfunctions. that it took too long to enter the treatment plan. In re- Malfunction messages were commonplace - most did not sponse, the manufacturer modified the software before the involve patient safety. Service technicians would fix the first unit was installed so that, instead of reentering the problems or the hospital physicist would realign the madata at the keyboard, operators could use a carriage return chine and make it operable again. She said, "It was not to merely copy the treatment site data.' A quick series of out of the ordinary for something to stop the machine... It carriage returns would thus complete data entry. This inter- would often give a low dose rate in which you would turn face modification was to figure in several accidents. the machine back on... They would give messages of The Therac-25 could shut down in two ways after it de- low dose rate, V-tilt, H-tilt, and other things; I can't retected an error condition. One was a treatment suspend, member all the reasons it would stop, but there [were] a which required a complete machine reset to restart. The lot of them." The operator further testified that during inother, not so serious, was a treatment pause, which re- struction she had been taught that there were "so many quired only a single-key command to restart the machine. safety mechanisms" that she understood it was virtually If a treatment pause occurred, the operator could press the impossible to overdose a patient. "P" key to "proceed" and resume treatment quickly and A radiation therapist at another clinic reported an averconveniently. The previous treatment parameters remained age of 40 dose-rate malfunctions, attributed to underdosin effect, and no reset was required. This convenient and es, occurred on some days. simple feature could be invoked a maximum of five times before the machine automatically suspended treatment Reference and required the operator to perform a system reset. 1. E. Miller, "The Therac-25 Experience," Proc. Conf. State Padia- Error messages provided to the operator were cryptic, tion Control Program Directors, 1987. Figure A. Operator Interface screen layout. 24 COMPLTER croswitch codes (which could be caused The problem was exacerbated by the The plunger could be extended when by a single open-circuit fault on the designol the mechanism that extends a the turntable was way out of position, switch lines) could produce an ambigu- plunger to lock the turntable when it is thus giving a second false position indious position message for the computer. in one of the three cardinal positions: cation. AECLdevised amethodtoindi- Turntable positioning The Therac-25 turntable design is important in under- hazard of dual-mode machines: If the turntable is in the standing the accidents. The upper turntable (see Figure wrong position, the beam flattener will not be in place. B) is a rotating table, as the name implies. The turntable in the Therac-25, the computer is responsible for posirotates accessory equipment into the beam path to pro- tioning the turntable (and for checking turntable position) duce two therapeutic modes: electron mode and photon so that a target, flattening filter, and X-ray ion chamber mode. A third position (called the field-light position) in- are directly in the beam path. With the target in the beam volves no beam at all; it facilitates correct positioning of path, electron bombardment produces X rays. The X-ray the patient. beam is shaped by the flattening filter and measured by Proper operation of the Therac-25 is heavily dependent the X-ray ion chamber. on the turntable position; the accessories appropriate to No accelerator beam is expected in the field-light posieach mode are physically attached to the turntable. The tion. A stainless steel mirror is placed in the beam path turntable position is monitored by three microswitches and a light simulates the beam. This lets the operator see corresponding to the three cardinal turntable positions: precisely where the beam will strike the patient and make electron beam, X ray, and field light. These microswitches necessary adjustments before treatment starts. There is are attached to the turntable and are engaged by hard- no ion chamber in place at this tumtable position, since no ware stops at the appropriate positions. The position of beam is expected. the turntable, sent to the computer as a 3-bit binary sig- Traditionally, electromechanical interlocks have been nal, is based on which of the three microswitches are de- used on these types of equipment to ensure safety - in pressed by the hardware stops. this case, to ensure that the turntable and attached equipThe raw, highly concentrated accelerator beam is dan- ment are in the correct position when treatment is started. gerous to living tissue. In electron therapy, the computer In the Therac-25, software checks were substituted for controls the beam energy (from 5 to 25MeV ) and current many traditional hardware interlocks. while scanning magnets spread the beam to a safe, therapeutic concentration. These scanning magnets are mounted on the turntable and moved into proper position by the 1. J.A. Rawlinson, "Report on the Therac-25," OCTRF/OCI Physicomputer. Similarly, an ion chamber to measure electrons is mounted on the turntable and also moved into position by the computer. In addition, operatormounted electron trimmers can be used to shape the beam if necessary. For X-ray therapy, only one energy level is available: 25MeV. Much greater electron-beam current is required for photon mode (some 100 times greater than that for electron therapy) 1 to produce comparable output. Such a high dose-rate capability is required because a "beam flattener" is used to produce a uniform treatment field. This flattener, which resembles an inverted icecream cone, is a very efficient attenuator. To get a reasonable treatment dose rate out, a very high input dose rate is required. If the machine produces a photon beam with the beam flattener not in position, a high output dose rate results. This is the basic Figure B. Upper turntable assembly. an independent upper collimator positioninginterlock on the Therac-25. Also, in January 1986, AECL received a letter from the attorney representing the Hamilton clinic. The letter said there had been continuing problems with the turntable, including four incidents at Hamilton, and requested the installation of an indcpendent system (potentiometer) to verify turntable position. AECL did not comply: No independent interlock was installed on the Therac25s at this time. Yakima Valley Memorial Hospital, 1985. As with the Kennestone overdose, machine malfunction in this accident in Yakima, Washington, was not acknowledged until after later accidents were understood. The Therac-25 at Yakima had been modified in September 1985 in responsc to the overdose at Hamilton. During December 1985, a woman came in for treatment with the Therac-25. She developed erythema (excessive reddening of the skin) in a parallel striped pattern at one port site (her right hip) after one of the treatments. Despite this, she continued to be treated by the Therac-25 because the cause of her reaction was not deter mined to be abnormal until January or February of 1986. On January 6, 1986, her treat ments were completed. The staff monitored the skin reaction closely and attempted to find possiblc causes. The open slots in the blocking trays in the Therac-25 could have produced such a striped pattern, but by the time the skin reaction had been determined to be abnormal, the blocking trays had been discarded. The blocking arrangement and tray striping orientation could not be reproduced. A reaction to chemotherapy was ruled out becausc that should have produced reactions at the other ports and would not have produced stripes. When it was discovered that the woman slept with a heating pad, a possible explanation was offered on the basis of the parallel wires that deliver the heat in such pads. The staff x-rayed the heating pad and discovered that the wire pattern did not correspond to the erythema pattern on the patient's hip. The hospital staff sent a letter to AECL on January 31, and they also spoke on the phone with the AECL technical support supervisor. On Fcbruary 24,1986 , the AECL technical sup- COMPUTER fix; she merely used the cursor up key to edit the mode entry. Since the other parameters she had entered were correct. she hit the return key several times and left their values unchanged. She reached the bottom of the screen where a message indicated that the parameters had been "verified" and the terminal displayed "beam ready," as expected. Sheh it the onc-key command "B" (for "beam on") to begin the treatment. After a moment, the machine shut dow n and the console displayed the message "Malfunction 54." The machine also displayed a "treatment pause," indicating a problem of low priority (see the operator interface sidebar). The sheet on the side of the machine explaincd that this malfunction was a "dose input 2" error. The ETCC did not have any other information available in its instruction manual or other Therac- 25 documentation to explain the meaning of Mallunction 54. An AECL technician later testified that "dose input 2" meant that a dose had been delivered that was either too high or too low. The machine showed a substantial underdose on its dose monitor display: 6 monitor units delivered, whereas the operator had requested 202 monitor units. The operator was accustomed to the quirks of the machine, which would frequently stop or delay treatment. In the past, the only consequences had been inconvenience. She immediately took the normal action when the machine merely paused, which was to hit the "P" key to proceed with the treatment. The machine promptly shut down with the same "Malfunction 54" error and the same underdose shown by the display terminal. The operator was isolated from the patient, since the machine apparatus was inside a shielded room of its own. The only way the operator could be alerted to patient difficulty was through audio and video monitors. On this day, the video display was unplugged and the audio monitor was broken. Af ter the first attempt to treat him, the patient said that he felt like he had received an electric shock or that someone had poured hot coffee on his back: He felt a thump and heat and heard a buzzing sound from the equipment. Since this was his ninth treatment, he knew that this was not normal. He began to get up from the treatment table to go for 27 help. It was at this moment that the personnel (including the quality assur- him that another patient appeared to operator hit the " P " key to proce edwith ance manager) told him that AECLknew have been burned. Asked by the physithe treatment. The patient said that he of noaccidents involving radiationover- cist to describe what he had experifelt like his arm was being shocked by exposure by the Therac-25. This seems enced, the patient explained that someelectricity and that hishandwas leaving odd since AECL was surely at least thing had hit him on the side of the face, his body. Hewent to the treatmentroom aware of the Hamilton accident that he saw a flash of light, and he heard a door and pounded on it. The operator had occurred seven months before and sizzling soundreminiscent of fryingeggs. was shockcd and immcdiately opened the Yakima accident, and, even by its He was very agitated and asked, "What the door for him. He appeared shaken own account, AECL learned of the happened tome,whathappenedtome?" and upset. Georgia lawsuit about this time (the This patient died from the overdose The patient was immediately exam- suithadbeen filed four months earlier). on May 1, 1986, threc weeks after the ined by a physician, who observed in- The AECL engineers then suggested accident. He had disorientation that tense erythemaover the treatmentarea, that an electrical problem might have progressed to coma, lever to 104 debut suspected nothing moreseriousthan caused this accident. grees Fahrenheit, and ncurologicaldamelectric shock. The patient was dis- Theelectricshock theorywaschecked age. Autopsy showed an acute highchargedwithinstructionsto returnif he out thoroughlybyanindependentengi- dose radiation injury to the right suffered any further reactions. The hos- neering firm. The final report indicated temporallobe of the brain and the brain pitalphysicistwascalledin, andhefound that there was no electrical grounding stem. the machine calibration within specifi- problem in the machine. and it did not cations. The meaning of the malfunc- appear capable of giving a patient an User and manufacturer response. Aftion message was not understood. The clcctrical shock. The ETCC physicist terthissecond Tyleraccident, theETCC machine was then used to treat patients checked the calibration of the Therac- physicist immediately tookthemachine for the rest of the day. 25 and found it to be satisfactory. The out of service and called AECL to alert In actuality, but unknown to anyone center put the machine back into ser- the company to this second apparent at that time, the patient had received a vice on April 7, 1986, convinced that it overexposure. The Tyler physicist then massive overdose, concentrated in the was performing properly. began his own carcful investigation. He center of the treatment area. After-the- worked with the operator, who rememfact simulations of the accident revealed East Texas Cancer Center, April 1986. bered exactly what she had done on this possible doses of 16,500 to 25,000 rads Three weeks after the first ETCC acci- occasion. After a great deal of effort, in less than 1 second over an area of dent, on Friday, April 11,1986, another the were eventually able to elicit the about 1cm. male patient was scheduled to reccive Malfunction 54 message. They deterDuring the weeks following the acci- an electron treatment at ETCC for a mined that data-entry speed during eddent. the patient continued to have pain skin cancer on the side of his face. The iting was the key factor in producing the in his neck and shoulder. He lost the prescription was for 10McV to an area error condition: If the prescription data function of his left arm and had periodic of approximately 710cm. The same was edited at a fast pace (as is natural bouts of nausea and vomiting. He was technician whohadtreated the first Tyler for someone who has repeated the proeventually hospitalized for radiation- accident victim prepared this patient cedure a large number of times), the induced myelitis of the cervical cord for treatment. Much of what follows is overdose occurred. causing paralysis of his lefl arm and from the deposition of the Tyler Ther- It took some practice beforethephysboth legs, left vocal cord paralysis (which ac-25 opcrator. icist could repeat the procedure rapidly left him unable to speak), neurogenic As with her former patient, she en- enough to elicit the Malfunction 54 mesbowel and bladder, and paralysis of the tered the prescription data and then sage at will. Once he could do this, he left diaphragm. He also had a lcsion on noticed an error in the mode. Again she set about measuring the actual dose his left lung and recurrent herpes sim- used the cursor up key to change the delivered under the crror condition. He plexskin inf ections. He died from com- mode from X ray to electron. Aftershe took a measurement of about 804 rads plications of the overdose five months finished editing, she pressed the return but realized that the ion chamber had after the accident. key several times to place the cursor on become saturated. Aftermaking adjustthe bottom of the screen. She saw the ments to extend his measurement abil- User andmanufacturer response. The "beam ready" message displayed and ity: he determined that the dose was Therac- 25 wasshut down for testing the turned the beam on. somewhere over 4,000rads. dayafter thisaccident. Onelocal AECL Within a few seconds the machine Thenextday, anengineerfrom AECL engineer and one from the home office shutdown, making a loud noise audible called and said that he could not reproin Canada came to ETCC to investi- via the (now working) intercom. The duce the crror. After the ETCC physigate. They spent a day running the ma- display showed Malfunction 54 again. cist explained that the procedure had to chine through tests but could not repro- The operator rushed into the treatment beperformed quite rapidly, AECLcould duccaMalfunction 54. The AECL home room. hearing her patient moaning for finally produce a similar malf unction office engineerreportedlyexplainedthat help. The patient began to remove the on its own machine. AECL then set up it was not possible for the Therac-25 to tape that had held his head in position its own set of measurements to test the overdose a paticnt. The ETCCphysicist andsaidsomething waswrong. She asked dosage delivered. Two days after the claims that he asked AECL at this time him what he felt, and he replied "fire" accident, AECLsaidtheyhadmeasured if there were any other reports of radi- on the side of hisface. She immediately the dosage (at the center of the field) to ation overexposure and that the AECL went to the hospital physicist and told be 25,000 rads. An AECL engineer ex- 28 COMPUTER unable to reproduce the error on his The software problem. A lesson to be machine. but two months later he found learned from the Therac-25 story is that the link. focusing on particular sof tware bugs is The Therac-20 at the University of not the way to make a safe system. VirChicago is used to teach students in a tually all complex sof tware can be made radiation therapy school conducted by to behave in an unexpectedfashion unthe center. The center's physicist, Frank der certain conditions. The basic misBorger, noticed that whenever a new takes here involved poor software-enclass of studentsstarted using the Ther- gineering practices and building a ac-20, fuses and breakers on the ma- machine that relies on the software for chine tripped, shutting down the unit. safe operation. Furthermore, the parThese failures, which had been occur- ticular coding error is not as important ring ever since the center had acquired as the general unsafe design of the softthe machine, might appcar three times a ware overall. Examining the part of the week while new students operated the code blamed for the Tyler accidents is machine and then disappear for months. instructive, however, in showing the Borger determined that new students overall software design flaws. The folmake lots of different types of mistakes lowing explanation of the problem is and use "creative methods of editing" from the description AECL provided parameters on the console. Through for the FDA, although we have tried to experimentation, he found that certain clarify it somewhat. The description editing sequences correlated with blown leaves some unanswered questions, but fuses and determined that the same com- it is the best we can do with the informaputer bug (as in the Therac-25 soft- tion we have. ware) was responsible. The physicist As described in the sidebar on Thernotified the FDA, which notified lher- ac-25 sof tware devclopment and design, ac-20 users. + the treatment monitor task (Treat) conThe software error is just a nuisance trols the various phases of treatment by on the Thcrac-20 because this machine executing itseightsubroutines (see Fighas independent hardware protective ure 2). The treatment phase indicator circuits for monitoring the electron- variable (Tphase) is used to determine beam scanning. The protective circuits which subroutinc should be executed. do not allow the beam to turn on, so Following the execution of a particular there is no danger of radiation exposure subroutine, Treat reschedules itself. to a patient. While the Therac-20 relies One of Treat's subroutines. called on mechanical interlocks for monitor- Datent (dataentry), communicates with ing the machine, the Therac-25 relies the keyboard handler task (a task that largely on sof tware. runs concurrently with Treat) via a Figure 2. Tasks and subroutines in the code blamed for the Tyler accidents. originally designed, the Data- Figure 3. Datent, Magnet, and Ptime subroutines. entry completion variable by itself is not sufficient since it does not ensure that the cursor is locat- separately. If the keyboard handler sets ed on the command line. Under the the data-entry completion variable berightcircumstances, thedata-entryphase fore the operator changes the data in can be exited before all edit changes are MEOS,Datent will notdetect thechangmade on the screen. es in MEOS since it has already exited The keyboard handlerparsesthemode and will not be reentered again. The and energy level specified by the oper- upper collimator, on the other hand, is ator and places an encoded result in set to the position dictated by the lowanother shared variable, the 2-byte order byte of MEOS by another concurmode/energy offset (MEOS) variable. rently running task (Hand) and can The low-order byte of this variable is therefore be inconsistent with the paused by another task (Hand) to set the rameters set in accordance with the incollimator/turntable to the proper posi- formation in the high-order byte of tion for the selected mode/energy. The MEOS. The sof tware appears to include high-order byte of the MEOS variable nochecks to detect such an incompatiis used by Datent to set several operat- bility. ing parameters. The first thing that Datent does when Initially, the data-cntry process forc- it is entered is to check whether the es the operator to enter the mode and mode/energy hasbeen set in MEOS. If energy, except when the operator se- so, it uses the high-order byte to index lects the photon mode, in which case the into a table of preset operating paramenergy defaults to 25MeV. The opera- eters and places them in the digital-totor can later edit the mode and energy analog output table. The contents of 30 On May 2, 1986, the FDA declared the Therac def ective, demanded a CAP, and required renotification of all the Therac customers. In the letter f rom the FDA to AECL. the director of compliance. Center for Devices and Radiological Hcalth, wrote We have reviewed Mr. Downs April1s letter to purchasers and have concluded that it docs not satisfy the requirements [or notification to purchaser s of a defect in an electronic product. Specifically, it does not describe the defect nor the hazards associated with it. The letter does not provide any reason for disablingthe cur sor key and the tone is not commensurate with the urgency for doing so. In fact. the letter implies the inconvenience to operators outweighs the need to disable the key. We request that you immediately renotify purchasers. AECL promptly madc a new notice to users and also requested an extension to produce a CAP. The FDA granted this request. About this time, the Therac- 25 users created a user group and held their first believe that the rigorous testing must be The beam came on but the console the turntable was in the field-light posiperformed each time a modification is displayed no dose or dose rate. After 5 tion-was on the order of 4,000to 5,000 made in order to ensure the modification or 6 seconds. the unit shut down with a rads. After two attempts, the patient does not adverscly affect the safety of pause and displayed a message. The could have received 8,000 to 10,000 in- the system. the system. message "may have disappeared quick- stead of the 86 rads prescribed. AECL AECL was also asked to draw up an ly": the operator was unclear on this again called users on January 26 (nine installation test plan to ensure both hard- point. However, since the machine mere- days af ter the accident) and gave them ware and sof tware changes perform as ly paused, he was able to push the "P" detailed instructions on how to avoid designed when installed. key to proceed with treatment. this problem. Inan FDA internal report AECLsubmitted CAP Revision 2 and The machine paused again, this time on the accident, an AECL qualityassursupporting documentation on Decem- displaying "flatness" on the reason line. ance manager investigating the probber 22,1986. They changed the CAP to The operatorheardthcpatientsaysome- lem is quoted as saying that the softhave dose malf unctions suspend treat- thing over the intercom, but couldn't ware and hardware changes to be ment and included a plan for meaning- understand him. He went into the room retrofitted following the Tyler accident ful error messages and highlighted dosc to speak with the paticnt. who reported nine months earlicr (but which had not error messages. They also expanded "feelingaburningsensation"in the chest. yetheeninstalled) would have preventdiagrams of sof tware modifications and The console displayed only the total ed the Yakimaaccident. expanded the test plan to cover hard- dose of the two film exposures ( 7 rads) The patient died in April fromcomware and software. and nothing more. plications related to the ovcrdose. He On January 26, 1987. AECL sent the Later in the day, the patient devel- hadbecn suffering fromaterminalform FDA their "Component and Instal- oped a skin burn over the entire treat- of cancer prior to the radiation overlation lest Plan" and explained that ment area. Four days later, the rcdness dose, but survivors initiated lawsuits their delays were due to the investiga- took on the striped pattern matching allcging that he died sooner than he tion of a new accident on January 17 at the slotsin the blockingtray. The striped would have and endured unnecessary Yakima. pattern was similar to the burn a year pain and suffering due to the overdose. earlier at this hospital that had been The suit was settled out of court. Yakima Valley Memorial Hospital, attributed to "cause unknown." 1987. On Saturday, January 17, 1987, AECL began an investigation, and The Yakima software problem. The the second patient of the day was to be users were told to confirm the turntable software problem for the second Yakitreated at the Yakima Valley Memorial position visually before turning on the ma accident is fairly well established Hospital for a carcinoma. This patient beam. All tests run by the AECL engi- and different from that implicated in was to receive two film-verification ex- neers indicated that the machine was the Tyler accidents. There is no way to posures of 4 and 3 rads. plus a 79-rad working perfectly. From the informa- determine what particular software dephoton treatment (for a total exposure tion gathered to that point, it was sus- sign errors were related to the Kenneof 86rads ). pected that the electron beamhad come stone, Hamilton. and first Yakima acciFilm was placed under the patient on when the turntable was in the field- dents. Given the unsafc programming and 4 rads was administered with the lightposition. Buttheinvestigatorscould practices exhibited in the code, it is collimator jaws opened to 2218cm. not reproduce the fault condition that possible that unknown race conditions After the machine paused. the collima- produccd the overdose. or errors could havc been responsible. tor jaws opened to 3535cm automat- On the following Thursday. AECL There is speculation, however, that the ically, and the sccondexposure of 3 rads sent an engineer from Ottawa to inves- Hamilton accident was the same as this was administered. The machine paused tigatc. The hospital physicist had, in the second Yakima ovcrdose. In a report of again. meantime. run some tests with film. He a conference call on January 26, 1987, The opcrator entered the treatment placed a film in the Therac's beam and betwcen the AECL quality assurance room to remove the film and verify the rantwo exposures of X-rayparameters manager and Ed Millcr of the FDA patient's precise position. He used the with the turntable in field-light posi- discussing the Yakimaaccident, Miller hand control in the treatment room to tion. The film appeared to match the notcs rotate the turntable to the field-light film that wasleft (bymistake) under the position. a feature that let him check patient during the accident. This situation probablyoccurredin the the machine's alignment with respect to After a week of checking the hard- Hamilton, Ontario, accident a couple of the patient's body to verify proper bcam ware, AECI determined that the "in- years ago. It was not discovered at that position. The operator then either correct machine operation was proba- time and the cause was attrihuted to blynotcausedbyhardwarealone."After subsequent recall of the multiple trol or left the room and typed a set checking the sof tware. AECL discov- microswitch logic network did not really command at the console to return the ered a flaw (described in the next sec- solve the problem. turntable tothe proper positionf or treat- tion) that could explain the erroneous ment: there is somc confusion as to ex- behavior. The codingproblems explain- The second Yakima accident was again actly what transpired. When he left the ing this accident diffcrfrom those asso- attributed to a type of race condition in room. he forgot to remove the film from ciated with the Tyler accidents. the sof tware - this one allowed the underncath the patient. The console AFCL's preliminary dose measure- device tobeactivatedinanerrorsetting displayed "beam ready," and the oper- mentsindicated that the dose delivered (a "failure" of a software interlock). ator hit the "B" key to turn the beam on. under these conditions - that is, when The Tyler accidents were related toprob- July 1993 33 The Therac-25 safety analysis included (1) failure mode program changes to correct shortcomings, improve reliand effect analysis, (2) fault-tree analysis, and (3) software ability, or improve the software package in a general examination. sense. The final safety report gives no information about whether any particular methodology or tools were used in Failure mode and effect analysis. An FMEA describes the software inspection or whether someone just read the the associated system response to all failure modes of the code looking for errors. individual system components, considered one by one. When software was involved, AECL made no assessment Conclusions of the safety analysis. The final report of the "how and why" of software faults and took any com- summarizes the conclusions of the safety analysis: bination of software faults as a single event. The latter The conclusions of the analysis call for 10 changes to means that if the software was the initiating event, then no Therac-25 hardware; the most significant of these are credit was given for the software mitigating the effects. interlocks to back up software control of both electron This seems like a reasonable and conservative approach scanning and beam energy selection. to handling software faults. Although it is not considered necessary or advisable to rewrite the entire Therac-25 software package, considerable Fault-tree effort is being expended to update it. The changes recomleading to Class I hazards. To identify multiple failures and mended have several distinct objectives: improve the protecused fault-tree analysis. An FTA tion it provides against hardware failures; provide additional starts with a postulated hazard - for example, two of the able source package. Two or three software releases are top events for the Therac-25 are high dose per pulse and anticipated before these changes are completed. illegal gantry motion. The immediate causes for the event design and testing for both hardware and software is well are then generated in an AND/OR tree format, using a ba- under way. All hardware modifications should be completed sic understanding of the machine operation to determine and installed by mid 1989 , with final software updates the causes. The tree generation continues until all branch- extending into late 1989 or early 1990. es end in "basic events." Operationally, a basic event is The recommended hardware changes appear to add sometimes defined as an event that can be quantified (for protection against software errors, to add extra protection example, a resistor fails open). against hardware failures, or to increase safety margins. AECL used a "generic failure rate" of 104 per hour for The software conclusions included the following: software events. The company justified this number as based on the historical performance of the Therac-25 soft- The software code for Beam Shut-Off, Symmetry Control, based on the historical performance of the Therac-25 soft- and Dose Calibration was found to be straight-forward and ware. The final report on the safety analysis said that many no execution path could be found which would cause them fault trees for the Therac-25 have a computer malfunction to perform incorrectly. A few improvements are being incoras a causative event, and the outcome of quantification is porated, but no additional hardware interlocks are required. therefore dependent on the failure rate chosen for soft- Inspection of the Scanning and Energy Selection funcware. tions, which are under software control, showed no improper Leaving aside the general question of whether such fail- execution paths; however, software inspection was unable ure rates are meaningful or measurable for software in was due to the complex nature of the code, the extensive general, it seems rather difficult to justify a single figure of use of variables, and the time limitations of the inspection this sort for every type of software error or sof tware behav- process. Due to these factors and the possible clinical ior. It would be equivalent to assigning the same failure consequences of a malfunction, computer-independent interlocks are being retrofitted for these two cases. rate to every type of failure of a car, no matter what particular failure is considered. Given the complex nature of this software design and The authors of the safety study did note that despite the the basic multitasking design, it is difficult to understand uncertainty that software introduces into quantification, how any part of the code could be labeled "straightforfault-tree analysis provides valuable information in showing ward" or how confidence could be achieved that "no exesingle and multiple failure paths and the relative impor- cution paths" exist for particular types of software behavtance of different failure mechanisms. This is certainly true. ior. However, it does appear that a conservative approach Software examination. Because of the difficulty of including computer-independent interlocks - was taken quantifying software behavior, AECL contracted for a de- analyses of software exist in the literature. One such softtailed code inspection to "obtain more information on which ware analysis was performed in 1989 on the shutdown to base decisions." The software functions selected for ex- software of a nuclear power plant, which was written by a amination were those related to the Class I software haz- different division of AECL.' Much still needs to be learned ards identified in the FMEA: electron-beam scanning, ener- about how to perform a software-safety analysis. gy selection, beam shutoff, and dose calibration. The outside consultant who performed the inspection in- Reference cluded a detailed examination of each function's implementation, a search for coding errors, and a qualitative as- 1. W.C. Bowman et al., "An Application of Fault Tree Analysis to Safety-Critical Software at Ontario Hydro," Conf. Probabilistic sessment of its reliability. The consultant recommended Safety Assessment and Managerment. 1991. would be an option for other clinics. Sof tware documentation was described as a lower priority task that needed definition and would not be available to the FDA in any form for more than a year. On July 6,1987, AECLsent a letter to all users to inform them of the FDA's verbal approval of the CAP and delineated how AECL would procecd. On July 21,1987, AECL issued the fif th and final CAP rcvision. The major features of the final CAP are as follows: - All interruptions related to the dosimetry system will go to a treatment suspend, not a treatment pause. Operators will not be allowed to restart the machine without reentering all parameters. - A software single-pulse shutdown will be added. - An independent hardware singlepulse shutdown will be added. - Monitoring logic for turntable position will be improved to ensure that the turntable is in one of the three legal positions. - A potentiometer will be added to the turntable. It will provide a visible signal of position that operators will use to monitor exact turntable location. -Interlocking with the 270-degree bending magnet will be added to ensure that the target and beam flattener are in position if the X-ray mode is selected. - Beam on will be prcvented if the turntable is in the field-light or an intermediate position. - Cryptic malfunction messages will be replaced with meaningful messages and highlighted dose-rate messages. - Editing keys will bc limited to cursor up, backspace, and return. All other keys will be inoperative. - A motion-enablc foot switch will be added, which the operator must hold closed during movement of certain parts of the machine to prevent unwantcd motions when the operator is not in control (a type of "deadman's switch"). -Twenty-three other changes to the sof tware to improve its operation and reliability, including disabling of unused keys, changing the operation of the set and reset commands, preventing copying of the control programon site,changing the way various dctected hardware faults are handled, eliminating errors in the sof tware that were detected during the review process, adding several additional software interlocks, disallowing 37 dangeroussystem in such away that one incident-analysis procedures that they Software engineering. The Therac-25 failure can lead to an accident violates applywhenever they find any hint of a accidents were lairly unique in having basic system-engineering principles. In problem that might lead to an accident. software codingerrorsinvolved-most this respect, sof tware needs to be treat- Thefirst phone call by Stillshould have computer-related accidents have not ed as a single component. Software led to an extensive investigation of the involved codingerrors butrathererrors should not be assigned sole responsibil- events at Kennestone. Certainly, learn- in the software requirements such as ity for safety, and systems should not be ing about the first lawsuit should have omissionsandmishandledcnvironmendesigned such that a singlc software triggered an immediate response. Al- tal conditions and system states. Alerror or software-engineering error can though hazard logging and tracking is though using good basic software-engibe catastrophic. required in the standards for safety- neering practices will not prevent all A related tendency among cngineers critical military projects, it is less com- software errors. it is certainly required is to ignore software. The first safety moninnonmilitaryprojects. Everycom- as a minimum. Some companies introanalysis on the Therac-25 did not in- pany building hazardous equipment ducing software into their systems for clude sof tware (although ncarly full re- should have hazard logging and track- the first time do not take sof tware engisponsibilityfor safety rested on the soft- ing as well as incident reporting and neering as seriously as theyshould. Baware). When problemsstartedoccurring, analysis as parts of its quality control sic software-engineering principlesthat investigators assumed that hardware was procedures. Such follow-up and track- apparentlywereviolatedwith the Therthe cause and focused only on the hard- ing will not only help prevent accidents, ac-25include: ware. Investigation of sof tware's possi- but will easily pay for themselves in ble contribution to an accident should reduced insurance rates and reasonable - Documentation should not be an not be the last avenue explored after all scttlement of lawsuits when they do afterthought. other possible explanations are elimi- occur. - Sof tware quality assurance practicnated. Finally, overreliance on the numcri- es and standards should be estab. Infact.asoftware errorcanalwaysbc cal output of safety analyses is unwise. lished. attributed to a transient hardware fail- The arguments over whether very low - Designs should be kept simple. ure, since software (in these types of probabilities are meaningful with re- Ways to gel information about erprocess-control systems) reads and is- spect to safety are too extensive to sum- rors - for example. software audit sues commands to actuators. Without a marize herc. But. at the least, a healthy trails - should be designcd into the thorough investigation (andwithouton- skepticism is in order. The claim that software lrom the beginning. line monitoring or audit trails that save safetyhadbeen increasedfive orders of - The software should be subjected internalstate information), it is not pos- magnitude as aresult of the microswitch to extensive testing and formal sible to determine whether the sensor fix after the Hamilton accident sccms analysis at the module and sof tware provided the wrong information, the hard to justify. Perhaps it was based on level; system testing alone is not software provided an incorrect com- the probability of failure of the mi- adequate. mand, or the actuator had a transient croswitch (typically 105 ) ANDed with failurc and did the wrong thing on its the other interlocks. The problem with Inaddition.specialsafety-analysisand own. In the Hamilton accident, a tran- all such analy

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Project management the managerial process

Authors: Eric W Larson, Clifford F. Gray

5th edition

73403342, 978-0073403342

More Books

Students also viewed these General Management questions