Problems in applying expert system technology to Radiographic Image Interpretation Problems in Applying Expert System Technology to Radiographic Image Interpretation David W. Piraino, Bradford J. Richmond, Masataka Uetani, Thomas Luetkehaus, Daniel Rockey, George Belhobek, Joe Armistead, and Fred Jones A prototype expert system was developed to study the problems applying expert system technology to radiographic image interpretation. The Radiographic Image Interpretation System (RIIS) was developed on a microcomputer using Turbo Prolog, a low cost implementation of the prolog programming lan- guage. The present implementation of RIIS was developed to highlight potential problems in applying expert system technology in the evaluation of radio- graphic images. It was believed that the evaluation of this prototype expert system should include a large number of users unfamiliar with the program's use as this would probably be the case in clinical use of an image interpretation expert system. At present, the expert system deals with a limited domain of focal bony lesions. Twenty cases of pathologically proven bony lesions of varying difficulty were used to evaluate potential problems in the use of this expert system technology. RIIS, with the 20 sample cases, was presented as an exhibit at the 1987 Radiological Society of North America (RSNA) meeting to evalu- ate the potential problems with inexperienced users. These results were compared with those of experi- enced users. When a musculoskeletal radiologist, familiar with the programs use, provided the "proper description," the program averaged the correct diag- nosis in the top five 80% of the time. During the program's use at the RSNA meeting, the program selected the correct diagnosis in the top five 22% of the time. @) 1989 by W.B. Saunders Company KEY WORDS: Diagnosis, computer, expert system, bone tumor, image. EXPERT system technology, an outgrowth ofartificial intelligence research, has been applied in a wide variety of medical situations including radiographic differential diagnosis, hematologic disorders, evaluation of anemia, chemotherapy protocols, and the diagnosis of medical disorders.' Difficulties related to expert system application to radiographic image inter- pretation must address all the standard problems in applying expert system technology in any environment. In addition, radiographic differen- tial diagnosis or image interpretation systems must address the problem of an appropriate method to input image information into the expert system. Evaluation of an expert system is an important Journal of Digita/lmaging, Vol 2. No 1 (February). 1989: pp 21-26 part of its development. Evaluation of expert systems, however, remain difficult and a stan- dard evaluation process has not been developed. In general, a system may be evaluated on its accuracy, method of construction, and user impact. A system's accuracy can be judged in relation- ship to a standard such as pathology. Since an expert system uses information and knowledge used by experts to arrive at a conclusion, its relative accuracy should be judged against and by other experts in its area of expertise.' The techniques used to develop the expert system influence its user interface, methods of deduction, potential conclusions, and explana- tion of its conclusions. If a specific knowledge structure or user interface is chosen that is not appropriate for a expert system's environment, it will not be useful. The goal of expert system technology is to produce a system that can be used by non-experts to help them arrive at conclusions at a level comparable to experts. It is therefore important, when evaluating expert systems, to test the user interface for its ease of use and user acceptance.' It is also necessary to evaluate whether the expert system helps the non-expert to make decisions at an accuracy level comparable with that of an expert. Finally, the non-expert must be confident that the decision he or she reaches with the help of an expert system is appropriate. RIIS was developed on a personal computer to explore problems in implementing a prototype expert system for radiographic image interpreta- tion. The study of the problems in implementing this prototype were directed primarily toward expert system construction, user interface, and From the Cleveland Clinic Foundation, Tripier Army Medical Center, Honolulu and Henry Ford Hospital, Detroit. Address reprint requests to David Piraino. MD, Depart- ment of Radiology. The Cleveland Clinic Foundation, 9500 Euclid Ave, Cleveland, Ohio 44195-5021. © 1989 by W.H. Saunders Company. 0897-1889/89/0201-0004$03.00/0 21 22 accuracy. The investigation evaluated users familiar and unfamiliar with the program. RADIOGRAPHIC IMAGE INTERPRETATION SYSTEM The Radiographic Image Interpretation Sys- tem (RIIS) was developed to produce a differen- tial diagnosis for focal musculoskeletal lesions. Focal bone lesions were chosen because this domain can be defined relatively easily and because several of the authors are musculoskele- tal radiologists. Turbo Prolog, a fourth genera- tion language for microcomputers, was used to construct the program. The program uses rela- tive likelihoods and relative predictive values to produce a list of differential diagnostic possibili- ties. Several psychological models have been devel- oped to explain the process of image interpreta- tion. Differences in psychological models deal primarily with whether there is a preliminary expectation of the observer that affects the pro- cessing of the visual information before the actual inputting of the image information (Table 1).4 The RIIS prototype expert system only deals with the last transformation of the aforemen- tioned model from a conceptional understanding of the image to a decision about the most appro- priate diagnosis related to that image. This deci- sion to start with a language description of the image abnormalities places constraints on the expert system and imposes potential problems concerning image information input. The user of RIIS selects the positive radio- graphic findings from a list of possibilities pre- sented to the user by the program. Typical find- ings included in the initial implementation include bony matrix, chondroid matrix, bony expansion, geographic lesion, permeative pat- tern, and many other findings. The program considers any findings that are not selective as Table 1. Example of an Image Interpretation Psychological Model I. Input: Various light levels projected on retina II. Preprocess: Retina preprocesses information III. Segmentation: Grouping of input information IV. Understanding: Image groups recognized V. Decision: Decision on image diagnosis* *Data adopted from Kundel et al.4 PIRAINO ET AL not being present in the image. An information base relating each diagnostic entity to the radio- graphic findings is contained within the program. All findings associated with any specific diag- nosis such as osteogenic sarcoma are given a relative frequency from I to 5 and a relative predictive value from 1 to 5. A relative frequency of 1 indicates a very unusual finding associated with that disease while a 5 represents a finding that is always associated with that disease. A predictive value of 1 represents a completely nonspecific finding while a value of 5 represents a pathognomonic finding." Relative frequencies and relative predictive values were assigned by the consensus of two musculoskeletal radiologists after reviewing standard radiology textbooks. Each diagnosis in the information base is compared with the findings input by the user. In the first level of evaluation, each diagnosis is compared with the positive image findings. Any positive finding that a specific diagnosis can explain, raises the relative likelihood of that diagnosis. After all diseases have been evaluated using this technique, a second evaluation takes place. In the second evaluation process, the find- ings that are commonly seen in a specific disease entity but were not described as being present on the image, decrease the overall likelihood of that specific diagnosis. Finally, a correction is made for findings described as being present in the image, but do not occur in the specific diagnosis being considered. In this case, since the diagnosis does not explain all the findings on the image, the overall likelihood of that specific diagnosis is decreased. RIIS uses a simple inference engine to accom- plish the above results. A simple consecutive search is performed on the information base comparing the radiographic description of the specific image with the radiographic descriptions of each disease. The relative likelihoods of each diagnosis is then calculated. RIIS arranges the diagnoses in order of most likely first, least likely last, according to their calculated relative likelihoods. The relative like- lihoods are compared and different numeric selection criteria are applied to the relative likeli- hood for appropriate diagnosis selection. Dif- ferent selection criteria include: (1) considering only those whose relative value is >50% of the relative likelihood associated with the most likely PROBLEMS IN APPLYING EXPERT SYSTEMS TO X-RAYS diagnosis, and (2) considering only those diag- noses above any point in which there is a 50% difference in the relative likelihood of any two adjacent diagnoses. PROTOTYPE EVALUATION RIIS was evaluated with experienced and non- experienced users. Twenty cases of focal bony abnormalities in print form were used to evaluate the program. The cases were selected to repre- sent a range of diagnoses and a range of difficul- ties from classic cases to atypical cases. The cases were selected from the Cleveland Clinic teaching file and from clinical practice. The program along with the 20 cases in print form were presented as an exhibit at the 1987 RSNA meeting in Chicago. People viewing the exhibit were instructed how to use the RIIS system in print form and were encouraged to use the 20 sample cases as examples. A monitoring program was produced to retain statistics throughout the meeting to determine how well the users and the program performed for each case. The users entered their experience level, selected what they considered to be the appropri- ate differential diagnosis from a list of diagnostic possibilities, and entered the positive findings they identified on the radiograph. The program then produced its differential diagnosis, with associated relative likelihoods, which the user could compare with their differential diagnosis. Statistics were maintained for the program's use throughout the week as well as for each experi- encelevel. Subsequently, a musculoskeletal radiologist involved in the development of the program and familiar with its syntax and use developed a "proper description" of the abnormalities on each of the 20 cases. While the musculoskeletal radiologist knew the pathologic diagnosis, the proper descriptions were developed indepen- dently of the program. All the descriptions were reviewed before input into the program and any findings that were considered questionable were removed to avoid bias. The proper descriptions were entered into the program and its perfor- mance was then evaluated. The differential diagnosis included only those diagnoses above a cutoff point where the first 50% difference between any two adjacent diag- 23 noses was found in the differential diagnostic list. This same selection was applied both at the 1987 RSNA meeting and when the proper description was provided. RESULTS The results of the program's use at the RSNA meeting, the program's results with a proper description, and the results for RSNA users are presented graphically in Fig 1. At the RSNA meeting, the program was used 268 times by people with varying experience levels. The cor- rect diagnosis was selected in the program's differential diagnostic list in 22% of attempts. The correct diagnosis was selected first in 33 cases (12%); second in 16 cases (6%); third in six cases (2%); fourth in two cases (0.7%); and fifth in two cases (0.7%). The correct diagnosis was not listed in 209 attempts (78%). When the proper description was entered by our musculoskeletal radiologist, the program 90 80 70···· J 60 t 50 .c IV 0\ 4D.s j 30 20···· 10···· o CCMP-CffiR B/J-SPEC FEllOW RES-3f\D RES-1ST CCMP-RSNA GEN RES-41l-1 RES-2ND AVE Experience Leve Fig 1. This figure graphically demonstrates the per- centage by experience level in which the pathologic diag- nosis was selected in the top five diagnoses. RIIS with a proper description (COMP-CORR) selected the pathologic diagnosis in the top five 80% of the time and RIIS at the 1987 RSNA (COMP-RSNA) selected the pathologic diag- nosis 22% of the time. This compares with musculoskeletal specialists (B/J-SPEC) at RSNA who selected the correct diagnosis 71 % of the time. The remaining bars demonstrate the percentage of attempts in which the diagnosis was listed in the top five for the remaining experience levels including general radiologists (GEN), fellows (FELLOW), fourth-year residents (RES-4TH), third-year residents (RES-3RD), second-year residents (RES-2ND), first-year residents (RES-1ST), and the average (AVE) for all experi- ence levels. 24 selected the correct diagnosis in 16 of 20 cases (80%). In 11 of 20 (55%) of these cases, the correct diagnosis was listed as the most likely diagnosis and in five cases (25%), the correct diagnosis was listed as the second most likely. By comparison, musculoskeletal radiologists at the 1987 RSNA meeting included the proper diagnosis in their differential diagnostic list in 34 of 48 attempts (71 %). General radiologists included the correct diagnosis in their differen- tial diagnostic list in 58 of 109 attempts (47%). Statistics for individual cases at RSNA varied widely for the accuracy of both the program and the users. The best performance for the program at RSNA occurred in case no. 3, a chondrosarco- ma, where the program listed the proper diag- nosis as its most likely possibility in over 70% of attempts. This compares with musculoskeletal radiologists who listed chondrosarcoma 100% of the time as the most likely diagnosis but general radiologists listed chondrosarcoma only 30% of the time. The worse performance for the pro- gram at RSNA was case no 6 (Fig 2), a lytic osteogenic sarcoma. The correct diagnosis was never listed in the program's differential diag- nostic list in 15 attempts. In the 15 attempts at RSNA, only one participant listed lytic osteo- genic sarcoma in their differential diagnosis. When the proper descriptions were entered, the program did not include the proper diagnosis in four cases. The first case was again the lytic osteogenic sarcoma. The program did not list a lytic osteogenic sarcoma as a possibility although it included osteogenic sarcoma, chondrosarcoma, metastatic disease, and Ewing's sarcoma in its differential diagnostic list. Another case in which the correct diagnosis was not included by the program when a proper description was input was a case of a parosteal osteogenic sarcoma involving a finger. The program listed a single differential diagnostic choice, a juxtacortical chondroma. DISCUSSION The RIIS program was able to correctly diag- nose the lesions in the 20 sample cases in 22% of attempts with inexperienced users. This accu- racy level is significantly less than that of muscu- loskeletal experts. However, the system's accu- racy improves remarkably with a user familiar with the program and its associated syntax and PIRAINO ET AL Fig 2. Representation of case no 6, a lytic osteogenic sarcoma in the proximal fibula. The correct diagnosis was listed only once out of 16 attempts during the RSNA meeting. RIIS never listed the diagnosis during the RSNA meeting. When the proper description was provided, RIIS's differential diagnosis was osteogenic sarcoma, chondrosar- coma, metastatic disease, and Ewing's sarcoma. However, the correct diagnosis of lytic osteogenic sarcoma that is considered separately by the RIIS system was not included. language. In fact, with an experienced user, RIIS performed at a level comparable with musculo- skeletal radiologists. Also, the results from the RSNA meeting should not be strictly interpreted because (1) several of the users may have input descriptions slightly different from the sample cases to observe how the program performed in these instances and (2) 20 cases are a limited sample for evaluating the performance of an expert system. The RIIS did perform worse on the average than the general radiology user at our 1987 PROBLEMS IN APPLYING EXPERT SYSTEMS TO X-RAYS RSNA meeting and much worse than a muscu- loskeletal specialist. There was marked improve- ment when the proper descriptions were provided by an experienced user. While standard descrip- tions for focal bony lesions were used by the RIIS program, it appears that these descriptions were interpreted differently by different radiologists. For example, in a case of chondrosarcoma with chondroid matrix, several users described the matrix as groundglass or bone. In these instances, this description of the matrix is consid- ered incorrect and this incorrect description would adversely affect the program's diagnostic list. The proper description in such a case should include chondroid matrix. While all incorrect descriptions may not be as obvious as the above example, such descriptions lead the program further away from the correct diagnosis. The possibility of providing an incorrect description to the system raises the question of whether providing a description of x-ray abnor- malities is a legitimate method for inputting abnormal findings on an image or radiograph. Other possibilities for inputting image informa- tion include direct input of the image, using more descriptive terminology to describe the findings, or using graphic drawings of possible findings. Presently, direct input of the radiographic image is a relatively simple technical task with a state of the art video digitizer. However, the process of abstracting anatomic structures and anatomic abnormalities is an extremely difficult task that is only beginning to be investigated.' Therefore, the direct input of radiographic images would not be practical on a microcomputer. Possible interim solutions would include dia- gramatic graphic representations of x-ray abnor- malities and a more descriptive input environ- ment. Simplified graphic representations of radiographic abnormalities might help eliminate potential discrepancies between use of standard radiographic descriptions. The user could simply select the graphic representation of specific ab- normalities that could then be assembled by the system to produce a schematic drawing of the abnormality to provide user feedback. The graphic descriptions could then be assembled by the program for symbolic processing of either matched descriptive information or direct pro- cessing of the schematic diagrams. A second and perhaps simpler solution relates 25 to a more simplified descriptive input environ- ment. In this type of environment, the program itself makes interim conclusions about the pres- ence or absence of a finding from simplified descriptions. For example, instead of inquiring whether a bony matrix is present within a lesion, the system questions the user as to whether the abnormality demonstrates increased, decreased, or mixed density abnormalities. Subsequently, if the user selects increased density, the program questions more specifically about the location of the area of increased density and asks for a description of the characteristics of the increased density. The program then uses the more descrip- tive terminology to arrive at a conclusion about the presence or absence of a bony matrix. Another major problem area involves the inference methodology or program logic. The inference engine in this system is simplistic in nature. The complete sequential match and score technique works relatively well in this small limited domain system as shown by the very good results observed with the proper descriptions. However, in larger more complex domains, this technique becomes computationally intensive and does not lend itself well to explanation of the conclusions made by the system. A second important problem area in the pro- gram's logic is its selection criteria. A cutoff at the first 50% relative likelihood difference may not be appropriate, especially when the relative likelihood of all selected lesions are relatively low. This selection criteria is certainly simplistic in nature and may be inadequate for this type of decision. This problem relates to the expert sys- tem prototype construction and highlights how a specific expert system technique or programming technique can affect the accuracy of perfor- mance of an expert system. An important evaluation criteria of an expert system is to determine that its information or knowledge base is factually and conceptually accurate. This task is quite difficult in the medi- cal domain, especially with relative predictive values and frequency measures. Inaccuracies in determining relative frequencies or relative pre- dictive values would be expected to significantly change the accuracy of the program. It is diffi- cult to determine the relative frequency and predictive value units from standard radio- graphic textbooks. The inability to confirm the 26 accuracy of the knowledge base is a significant drawback to this type of knowledge or informa- tion structure. User acceptibility and user confidence in the RIIS system was not evaluated. User "believabil- ity" in the system is an extremely important evaluation criteria especially if such expert sys- tems are to be used in a clinical setting. It is therefore important to structure further evalua- tions of expert systems to evaluate user accep- tance and to determine if the expert system helps non-experts to perform as experts. Furthermore, it is probably more important to provide the user PIRAINO ET AL with a good explanation as to why certain dis- eases were selected rather than simply presenting the user with a differential diagnostic list. There are several systems at present that provide such an explanation as a critiquing facility in their use.' Finally, in order for expert systems to be used in a clinical setting, they must be widely avail- able, relatively inexpensive, and user friendly. While there are many problems remaining to be solved before expert systems are clinically accepted, there is much potential in specific limited domain areas. REFERENCES 1. Bar A, Feigenbaum EA: The Handbook of Artificial Intelligence (vol 3). Reading, Addison-Wesley, 1982 2. Quaglin S, Stefanelli M, Barosi G, et al: A performance evaluation of the expert system ANEMIA. Comput Biomed Res 21:307-323,1988 3. Hudson DL, Cohen ME: The role of user-interface in a medical expert system, in Ackerman MJ, (ed): Proceedings Symposium on Computer Applications in Medical Care (SCAMC), Washington, DC, Institute of Electrical and Electronics Engineers, (IEEE) Computer Society, 1985, pp 232-236 4. Kundel HL, Nodine CF, Doi K: Human interpretation of displayed images, in Hendee WR, Wells PN (eds): Engi- neering Research in Visual Perception. Chicago, American College of Radiology, 1986 5. Vries JK, Banks G, Mclinden S, et al: Three- dimensional neuro-imaging using octree encoding, in Acker- man MJ (ed): Proceedings SCAMC 1955,697 6. Miller RA, Pople HE, Myer JD: Internist-I, an experi- mental computer-based diagnostic consultant for general internal medicine. N Engl J Med 307:466-476,1982 7. Swett HA, Miller PL: ICON: A computer-based approach to differential diagnosis in radiology. Radiology 163:555-558,1987