Expert Systems With Applications. Vol. 2, pp. 47-58, 1991 0957-4174/91 $3.00 + .00 Printed in the USA. V 1991 Pergamon Press Plc Rule-Based Training of Neural Networks STAN C. KWASNY Center for Intelligent Computer Systems,' Department of Computer Science , Washington University, St. Louis, MO KANAAN A. FAISAL King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia Abstract-Rule-based expert systems either develop out of the direct involvement of a concerned expert or through the enormous efforts of intermediaries called knowledge engineers . In either case, knowledge engineering tools are inadequate in many ways to support the complex problem of expert system building. This article describes a set of experiments with adaptive neural networks which explore two types of learning, deductive and inductive, in the context of a rule-based, deterministic parser of Natural Language. Rule-based processing of Language is an important and complex domain. Experiences gained in this domain generalize to other rule-based domains. We report on those ex- periences and draw some general conclusions that are relevant to knowledge engineering activities and maintenance of rule-based systems. 1. INTRODUCTION IN THE FIELD of Artificial Intelligence (Al), one of the primary goals is to build intelligent systems . Since the 1970s, many such systems have been most easily built symbolically as rule-based systems. In many applica- tion areas, the most popular of these are known as rule-based expert systems. These systems have become ubiquitous in their application to everything from medical diagnosis to oil exploration to stock market trading. Typically, these systems develop from con- sultations with recognized experts who provide knowl- edge and experience during the process of rule devel- opment and debugging. Most expert systems use rules in some capacity. Usually, the rule formalism becomes the de facto "pro- gramming language" for the application and the rules Requests for reprints should be sent to Stan C . Kwasny, Center for Intelligent Computer Systems , Department of Computer Science, Washington University, Campus Box 1045, 1 Brookings Drive, St. Louis, MO 63130. S.C.W. gratefully acknowledges the support of King Fahd Uni- versity of Petroleum and Minerals. We thank the reviewers for their comments which have improved aspects of our presentation of this work. The authors also express gratitude to William E . Ball, Georg Dorffner, Marius Fourakis , David Harker, Dan Kimura , Barry Kal- man, Ron Loui, John Merrill, Gadi Pinkas , and Robert Port for thoughtful discussions and comments concerning this work. • The sponsors of the Center are McDonnell Douglas Corporation, Southwestern Bell Telephone Company, and Mitsubishi Electronics America, Inc. are executed like statements in the language . Building such systems is no trivial task, often occupying the full- time efforts of dozens of people . Much of the effort is spent talking with human experts , gaining their per- spective in an application domain , and teasing out the salient information that allows the expert to perform his task with a high degree of skill . Once some knowl- edge is acquired , it becomes the basis for system or- ganization and rules that mimic the expertise . However, acquisition of such knowledge is an ongoing concern even as refinements are made during maintenance. The task is not a simple one. Even with the most cooperative experts and best tools , rules are often dif- ficult to formulate . Ad hoc mechanisms that relate premise to conclusion have been invented to capture some of the vagueness and uncertainty of the relation- ships articulated by the expert. Certainty factors, for example , were invented for this purpose . Extensive ev- idence showing the difficulty of knowledge acquisition can be found in a recent special issue of SIGART (Westphal & McGraw, 1989). The task of Natural Language Understanding , in fact , has been perceived as a task of knowledge engineering ( Shapiro & Neal, 1982). Once formulated , rules are difficult to debug and maintain . In XSEL/XCON , for example, it is reported (Barker & O'Connor , 1989) that 40% of the rules re- quire changes each year to adapt to new products and for other reasons. Systems like TEIRESIAS (Davis, 1982) have been specifically designed to aid in the cri- 47 48 tiquing and reformulation of rule-based systems by the experts themselves. However, the expert is still required to operate at the level of rules, which may be very un- natural in many domains. Rules, no matter how so- phisticated, do not always make the best programming language. Even with better tools, once performance reaches a certain point it may prove difficult to achieve an im- proved level of performance. Upon analysis, some of the difficulty may be attributable to design decisions made early in the construction of the system which lose their validity as the system evolves. Such problems often require changes to the basic design and complete retooling of the system. Traditionally, rules also play an important role in the processing of Natural Language. In fact, the com- parison between rule-based expert systems and rule- based Natural Language systems is close indeed, es- pecially in examining some of the difficulties encoun- tered. Many linguistic rule systems have been proposed, although the rules in these systems may not always appear explicitly as in rule-based expert systems. Sym- bolic rules tend to be an important vehicle of specifi- cation for the symbolic processing requirements of both types of systems. The goal for expert systems is to symbolically de- scribe a domain in such a manner that relevant deci- sionmaking can occur and that a computer system can perform at the level of expert in this regard. But, is this always a realistic goal? While progress in Natural Lan- guage Processing (NLP), for example, continues to be made, why does this continue to be only a partially solved problem? Or, more specifically, why is there no operationally complete set of symbolic rules for English (i.e., no set of rules that perfectly describes all under- standable English utterances)? Certainly, it is neither for lack of experts nor lack of effort. Perhaps, if we agree with Winograd and Flores (1987, p. 107), the answer is that ". . . computers cannot understand language ." We do not believe computer systems are so impaired. The problems raised for rule-based expert systems as well as those mentioned for rule-based NLP could be attacked through learning. Learning would seem to be a solution, but traditional learning approaches too often require even deeper insights and understanding of the problem domain than the expert himself may possess. While strictly symbolic learning may be pos- sible, such systems typically need to invent new sym- bols to extend their capabilities symbolically and this imposes limitations on the approach. If learning could be conducted in a sufficiently flex- ible manner, a trainable system could be taught the rules articulated by the expert and then be led to adapt to those cases where the rules fail. In a similar fashion, the competence rules of a grammar can be taught in concert with the performance examples of NLP. This S. C Kwasny and K. A. Faisal distinction between easily articulated rules and the more difficult ones is analogous to the distinction be- tween textbook and experiential learning. The new in- tern, for example, is filled with book learning, but lacks the wisdom and knowledge of the domain which only comes through experience. Connectionist (neural) net- works hold promise for providing such a flexible learn- ing scheme. This article presents results from experimentation with a connectionist parsing system. The system is simply viewed as an example of a complex rule-based system. Conclusions reached in these experiments, therefore, have consequences for many other types of rule-based systems. 2. RULE-BASED CONNECTIONIST PARSING Natural Language processing is both symbolic and subsymbolic. It is symbolic in the role symbols play in writing systems and in the chunking of concepts, while it is subsymbolic because of the fuzziness of concepts, and the apparent high degree of parallelism in the ac- tivity. In building parsers, the subsymbolic aspect of language is usually lost, although there have been at- tempts to construct parsers that are totally subsymbolic (see Fanty, 1985; Selman & Hirst, 1985; Waltz & Pol- lack, 1985). Rules usually play a sacred role in parsing systems and, as mentioned earlier, are often executed as if fol- lowing instructions in a program. Whether we are only interested in building robust expert systems, or are in- terested in modeling human expertise, this method is incorrect. Rules should be permitted to play an advisory role only-that is, for guidance in typical situations and not as prescriptions for precise processing. In the case of English, if a complete set of rules for all meaningful English forms existed, then it might be satisfactory to rely on a rule-based approach. But no such set of rules exists, nor does it seem desirable or even possible to construct such a set. Any rule-based system that is based literally on rules tends to be brittle since there is no direct way to process inputs that are not anticipated to some extent. Furthermore, the ac- quisition of new rules often requires tedious retuning of existing rules. The only solution to these problems in a practical and realistic manner is through learning. We have developed a connectionist deterministic parsing system called CDP which offers solutions to these problems (Kwasny & Faisal, 1989). Training can be conducted either from rules or from examples of processing. The resultant network is then tested on a variety of novel sentence patterns and its generalization capabilities studied. A set of experiments are presented which support the claim that Natural Language can be syntactically processed in a robust manner using a connectionist deterministic parser. The model is trained based on a II Rule-Based Training of Neural Networks set of deterministic grammar rules and tested with sen- tences which are grammatical and ones that are not. Tests are also conducted with sentences containing lexically ambiguous items. 2.1. Deterministic Parsing For a complete understanding of the motivation and architecture of CDP, it is necessary to describe the work on which it is based. For a readable description of deterministic, "wait-and-see" parsing, see Winston (1984), or for a more thorough discussion, see Marcus (1980). The determinism hypothesis restricts Natural Lan- guage Processing to a deterministic mechanism. It states that Natural Language can be parsed by a mechanism that op- erates `strictly deterministically' in that it does not simulate a nondeterministic machine . . . (Marcus, 1980, p. 11) It follows from this hypothesis that NLP need not de- pend on backtracking, nor are any partial structures produced during parsing which fail to become part of the final structure. This is equivalent to prohibiting chains of reasoning in a rule-based expert system which ultimately do not contribute to the final answer. Ob- viously, processing is restricted in a major way under this assumption. PARSIFAL (Marcus, 1980) is a demonstration of a a deterministic, rule-based parser of Natural Language. Extensions to this system have been proposed for processing ungrammatical sentences [PARAGRAM (Charniak, 1983)], for resolving lexical ambiguities [ROBIE (Milne, 1986)], and for acquiring syntactic rules from examples [LPARSIFAL (Berwick, 1985)]. We have found it beneficial to combine these four tasks into one implementation which is partly symbolic and partly connectionist. The connectionist approach has particular advantageous in unifying these four sys- tems (Kwasny, Faisal, & Ball, in press). The system architecture of PARSIFAL has been reconfigured and the behavior of the rules and other mechanisms from these systems are being simulated using a neural net- work simulator. Learning in the network is achieved through backward propagation discovered indepen- dently by Werbos (1974) and Rumelhart, Hinton, and Williams (1986). As illustrated in Figure 1, the primary components of a deterministic parser are a buffer for lookahead in the sentence, a stack for processing embedded struc- tures, and a collection of rules for controlling the building and movement of constituents of the sentence being processed. Processing occurs essentially left-to- right. The absence of backtracking is an important ad- vantage in developing a connectionist-based parser since structures, once built, are never discarded. 49 Stack Buffer FIGURE 1 . Deterministic Parsing in Schematic Form. Rules are partitioned into rule sets . A standard rec- ognize-act cycle is used to achieve a parse . A consequent of a rule may be the activation or deactivation of a rule set thereby providing a simple conflict resolution strategy . Conflicts within rule sets are resolved from the static ordering (i.e., numeric priority) of the rules. Actions can effect changes to both the stack and the buffer. As buffer positions are vacated at the far end, new sentence components flow in sequentially. If a successful parse is found, a termination rule will fire leaving the final structure on top of the stack. 2.2. Connectionist Deterministic Parsing CDP provides a setting for experimentation with a va- riety of grammars and network designs. It combines the basic mechanism of deterministic parsing with an adaptive, feed-forward neural network to enable the generalization and robustness of connectionism to be evaluated in this domain. A backpropagation neural network simulator, which features a logistic function that computes values in the range of - I to + 1, is being used in this work. The ultimate goal is to construct a mechanism capable of learning to deal syntactically with language in a robust manner. As Gallant (1988) points out, there are important advantages to constructing rule-based systems using neural networks. Our focus is on building a connect tionist parser, but with more general issues in mind. How successfully can a connectionist parser be con- structed and what are the advantages? Success clearly hinges on the careful selection of training sequences. Our experiments have examined two different ap- proaches and compared them (Faisal & Kwasny, in press). The "deductive" strategy uses rule "templates" de- rived from the rules of a deterministic grammar. It is deductive in the sense that it is based on rules that are general (in the sense that they must be applicable in a wide variety of processing situations), but specific sen- tence forms must be processed. The "inductive" strat- egy derives its training sequence from coded examples 50 of sentence processing . It is inductive in the sense that it is based on specific sentence examples , although a potentially wider variety of sentence forms must be processed. The goal of both deductive and inductive learning is to produce a network capable of mimicking the rules or sentences on which its training is based and to do so in a way that generalizes to many addi- tional cases. Once initial learning has been accom- plished, simulation experiments can be performed to examine the generalization capabilities of the resulting networks. In our implementation , a moderate-sized grammar developed from PARSIFAL is used for training. The entire set of grammar rules is contained in Appendix A. The grammar used in CDP is capable of processing a variety of sentence forms which end with a final punctuation mark. Simple declarative sentences, yes- no questions, imperative sentences , and passives are permitted by the grammar. The model actually receives as input a canonical representation of each word in the sentence in a form that could be produced by a simple lexicon . Such a lexicon is not part of the model in its present form. Experiments are conducted to determine the effec- tiveness of training and to investigate whether the con- nectionist network generalizes properly to ungram- matical and lexically ambiguous cases. In comparison to the other deterministic parsing systems, CDP per- forms favorably. Virtually all of our examples have been drawn from previous work. Much of the perfor- mance depends on the extent and nature of the training, of course, but our results show that through proper training a connectionist network can indeed exhibit the same behavioral effect as the rules . Furthermore, once trained, the network is efficient , both in terms of representation and execution. Deductive training generally performs well on all generalization tasks and outperforms inductive training by scoring generally higher on all experiments. Reasons for this include the specificity of the inductive training data as well as the lack of a large amount of training data in the inductive case required to provide sufficient variety. 3. LEARNING A RULE -BASED GRAMMAR A deterministic parser applies rules to a stack and buffer of constituents to generate and perform actions on those structures. One of its primary features, as men- tioned earlier, is that it does not backtrack, but proceeds forward in its processing never building structures which are later discarded. Training of CDP proceeds by presenting patterns to the network and teaching it to respond with an appro- priate action using backpropagation. The input patterns represent encodings of the buffer positions and the top of the stack from the deterministic parser. The output S. C. Kwasny and K A . Faisal of the network contains a series of units representing actions to be performed during processing and judged in a winner-take-all fashion . Network convergence is observed once the network can achieve a perfect score on the training patterns themselves and the error mea- sure has decreased to an acceptable level (set as a pa- rameter). All weights in the network are initialized to random values between -0.3 and +0 . 3. Once the net- work is trained , the weights are saved so that various experiments can be performed. A sentence is parsed by iteratively presenting the network with coded inputs and performing the action specified by the network. Each sentence receives a score representing the average strength of responses during processing. The closer the sentence matches the training patterns, the lower the error and the greater the strength . Strengths are used for comparison purposes only. Strengths are computed as the reciprocal of the error. The strength for each individual step is simply the re- ciprocal of the error for that step . The step error is computed as the Euclidean distance between the actual output and an idealized output consisting of a -1 value for every output unit except the winning unit which has a +I value. The errors for each step are summed and averaged over the number of steps . The average strength is the reciprocal of the average error per step. As mentioned earlier, there are two distinct ap- proaches to training a network to parse sentences. Each of these training strategies results in a slightly different version of CDP. The differences in the derivation of the two types of training patterns are illustrated in Fig- ure 2. Deductive training begins with deterministic grammar rules which are coded into rule templates, one rule template representing one grammar rule. In- stantiation of a rule template leads to a training pattern which is presented during learning . Coding and in- stantiation are discussed below. Inductive training is based on traces of sentence processing itself. The coded training patterns derived in this way have in some sense r----------------t ----------- ----- Im'miWUic Grammes Rules U Coded Rule Templates U cgd Pa Training Panama ---------------- Sentence Traces U Ceded Training Pastes. I. J Deductive Training Inductive Training FIGURE 2. Extraction of Deductive and Inductive Training Pat- tems. Rule-Based Training of Neural Networks already been instantiated and, therefore, are suitable for learning with no further translation. 3.1. Deductive Learning Each grammar rule is coded as a training template which is a list of feature values, but templates are not grouped into rule packets as in PARSIFAL. Each con- stituent in the rule is represented by an ordered feature vector of +1 (on), -1 (off), or ? (do not care). Instan- tiation of the vector occurs by randomly changing ? to +1 or -1. Each template, therefore, can be instantiated into many patterns making each epoch of training slightly different. During training, the network learns the inputs which are highly correlated with expected outputs and those that are not. Training is arranged so that ? values are uncorrelated with the outputs from those training patterns. Each rule template containing n ?'s can generate up to 2" unique training cases. Some rule templates have up to 30 ?s which means they represent approximately 109 training cases. It is obviously impossible to test the performance of all these cases, so a zero is substituted for each ? in the rule template to provide testing pat- terns. While in actual processing a zero activation level for a feature will never be encountered, zero is a good test since it represents the mean of the range of values seen during training. Grammar rules are coded into rule templates by concatenating the feature vectors of the component constituents from the stack and buffer . Each grammar rule takes the following form: {< Ist Item)(2nd Item ) (3rd item) -. Action} For example, a rule for Yes/No questions would be written: {(S node)("have")(NP)(VERB, -en) -0 Switch Ist and 2nd items} while a rule for imperative sentences would be written: {(S node)("have" )(NP)(VERB, inf) o Insert YOU) By replacing each constituent with its coding, a rule template is created. In the two rules above, rule tem- plates are created with a ?value for many of the specific verb features of the initial form "have" in each rule, but are carefully coded for the differences in the third buffer position where the primary differences lie. Be- cause different actions are required, these are also coded to have different teaching values during training. 51 Appendix A contains the grammar rules used as a basis for all deductive training experiments in this study. Our rules were derived from the grammar con- tained in Appendix C of Marcus (1980) which includes those rules specifically discussed in building a case for deterministic parsing. They can be taken as represen- tative of the mechanisms involved. To assure good performance by the network, training has ranged from 50,000 to 200,000 presentations cycling through train- ing cases generated from the rule templates. Once training is complete, the parser that uses the network correctly parses those sentences that the original rules could parse. 3.2. Inductive Learning Inductive training depends on training patterns derived from traces of processing of actual sentences. This pro- cessing is guided by application of the rules of a deter- ministic grammar as before. This form of training requires that the network demonstrate the correct rule-following behavior after training runs with a com- paratively small sample of sentence traces. Although no symbolic rules are learned, the behavior of the rules is captured within the weights of the network. Fur- thermore, the behavior is guaranteed to approximate the behavior required in the sample training sentences as closely as desired, depending on the convergence rate and the quantity of training employed. The primary difference between the two forms of learning is seen if we consider the space of possible patterns. With deductive training, that space is system- atically presented during learning in such a way that each major distinction to be made during processing is represented in each epoch. Inductive training hap- pens in a less systematic way with no guarantee of ap- propriate representation of cases. Thus, deductive training imposes an ordering on the training patterns that assures a completeness which is difficult to achieve with inductive training, but inductive training patterns reflect the frequency of rule occurrences seen in pro- cessing the actual samples of sentences. A small set of positive sentence examples were traced which resulted in 64 unique training patterns. These were used for all inductive experiments in this study. 3.3. Architecture of CDP As Figure 3 illustrates, CDP is organized into a sym- bolic component and a subsymbolic component. Ac- tions in CDP are performed symbolically on traditional data structures which are also maintained symbolically. The subsymbolic component is implemented as a nu- meric simulation of an adaptive neural network. The symbolic and numeric components cooperate in a tightly coupled manner since there are proven advan- tages to this type of organization (Kitzmiller & Ko- 52 Sub- Symbolic Symbolic Input Stream II Parse Structure FIGURE 3. CDP System Organization. walk, 1987). For CDP, the advantages are performance and robustness. It is the responsibility of the symbolic component to handle the input sentence coding it for presentation to the network . In the subsymbolic component, the network produces a designation of an action to be per- formed by producing an activation pattern across out- put units . Activation of output units is interpreted in a winner-take-all manner, with the highest activated unit determining the action to be taken . Actions them- selves are performed symbolically on conventional data structures. The whole process is very efficient in time and space, although learning itself occurs off-line and is a time-consuming process. In the set of experiments described here, the network has a three-layer architecture as illustrated in Figure 4, with 35 input units, 20 hidden units, and 20 output units. Each input pattern consists of three feature vec- tors from the buffer items and one stack vector. Each vector activates 14 input units in a pattern vector rep- resenting a word or constituent of the sentence. The stack vector activates seven units representing the cur- rent node on the stack . In our simplified version of the grammar, only two items are coded from the buffer and thus 35 input units are sufficient . One hidden layer has proven sufficient in all of these experiments. The output layer represents the 20 possible actions that can be performed on each iteration of processing. What the model actually sees as input during sen- tence processing is not the raw sentence but a canonical representation of each word in the sentence in a form that could be produced by a simple lexicon , although such a lexicon is not part of the model in its present form . Iteration over an input stream is performed by moving unprocessed sentence forms into the buffer as vacancies are created . Iteration ends when the buffer becomes empty and a stop action is requested by the network. At this point , it is instructive to follow an example with more details revealed as shown in Figure 5. When a sentence form like S. C. Kwasny and K A. Faisal appears in the input stream , the first three constituents fill the buffer. Note that in reality this is an ungram- matical sentence form . Later, it will be shown that CDP can correctly produce the structure shown . The con- tents of the three elements along with the contents of the top of the stack are coded into a feature vector and presented to the network producing a single action. The action is executed potentially producing changes to the buffer and stack . When processing stops, the final structure can be seen on the top of the stack, just as for PARSIFAL. 4. PERFORMANCE CDP is capable of successfully processing a variety of simple sentence forms such as simple declarative, pas- sive, and imperative sentences as well as yes-no ques- tions. For test and comparison between the inductive and deductive CDPs, several sentences are coded that would parse correctly by the rules of the deterministic parser . Also, several mildly ungrammatical and lexical ambiguous sentences are coded to determine if the net- work generalizes in any useful way. The objective is to test if syntactic context could aid in resolving such sit- uations. 4.1. Parsing Grammatical Sentences Experimentation with grammatical sentences dem- onstrates the ability of CDP to perform as PARSIFAL. Earlier we mentioned that convergence testing from the rule templates is possible by changing each ? to a zero value. Here we examine the performance of CDP with actual sentences. Grammatical sentences, by our definition, are those which parse correctly in the rule-based grammar from which we derived the training set. Table I shows several examples of grammatical sentences which are parsed successfully along with their response strengths in both deductive and inductive learning. Strengths are computed as the reciprocal of the av- erage error per processing step for each sentence. The error on each step is determined by taking the Euclid- ean distance between the actual vector of output unit activation values and an idealized vector with only the LAYERS: Output Cm un+u) Hidden (20 .mw) coded Actions ---------------------- l 4 \ Out (35 uniu) ------- ------------ r------oooooeoooooooo ;;occoeocmoooooo;; ooooooo; First K&; SxvM bnttt+ Suck 'John have should scheduled the meeting. FIGURE 4. Subsymbohc Component. Rule-Based Training of Neural Networks SUB-SYMBOLIC SYMBOLIC Buffer mhxi h,.^ should scheduled the meeting. Stack AUX 53 whcdulul ^ __ __ L - - - - - - - - - - - - - - - - J FIGURE S . CDP System Overview. winning unit turned on ( one comer of the hypercube in error space ). Strengths reflect the certainty with which actions for building structures are being selected, but should be used for relative comparisons only. Each example shows a relatively high average strength value, indicating that the training data has been learned well. Also, the deductive average strength value is higher , in almost all cases, than the corre- sponding inductive average strength . Although com- parisons are difficult to make due to variations in the number of unique training patterns and other factors, the deductively trained network exhibits uniformly more definitive decisions than the inductively trained network. Most of these sentences come from examples used in the deterministic parsing systems described earlier. Parse trees are developed which are identical with ones produced by those systems . Sentences ( 8)-(11), which contain ambiguous words, are presented to CDP un- ambiguously , but the lexical choices are provided in parentheses. Capabilities described thus far have only duplicated what is easily done symbolically . Of course , the feed- forward network does support very fast decisionmaking due to the feed-forward nature of the model. But what other features does the model possess? Importantly, how robust is the processing? 4.2. Lexical Ambiguity ROBIE extends PARSIFAL to address issues of lexical ambiguity. It requires additional rules and lexical fea- TABLE 1 Grammatical Sentences Used In Testing Sentence Form Deductive Avg. Strength Inductive Avg. Strength (1) John should have scheduled the meeting . 283.3 84.7 (2) John has scheduled the meeting for Monday. 179.3 84.2 (3) Has John scheduled the meeting ? 132.2 64.4 (4) John is sceduling the meeting . 294.4 83.5 (5) The boy did hit Jack. 298.2 76.2 (6) Schedule the meeting . 236.2 67.8 (7) Mary is kissed. 276.1 84.9 (8) Tom hit(v) Mary. 485.0 80.3 (9) Tom will(aux) hit(v) Mary. 547.5 78.7 (10) They can(v) fish(np). 485.0 80.3 (11) They can(aux) fish(v). 598.2 76.8 54 tures to handle these properly . In the deterministic ap- proach , it is essential that lexical items be properly dis- ambiguated to permit processing to proceed without backtracking. In a set of experiments with CDP, the parser is tested for this. Normal sentences are presented , except that selected words were coded ambiguously (here indicated by angle brackets < ) around the word) to represent an ambiguously stored word from the lexicon . Selected sentences are shown in Table 2. The numbers again indicate the average strength for each sentence. In the cases shown , the lexically ambiguous words are cor- rectly interpreted and reasonable structures result. CDP utilizes syntactic context to resolve these am- biguities . Again, the generalization capability of the network automatically works to relate novel situations to its training cases . For lexically ambiguous situations, some inputs may contain features which confuse its identity as expected by the parser . The context provided by the buffer and stack of the deterministic parser has proven to be sufficient to aid in resolving many am- biguities . An important fact is that, as before, no ad- ditional rules or mechanisms are required to provide this capability. For example, sentence ( 12) presents can ambigu- ously as an auxiliary modal , and main verb, white fish is presented uniquely as an NP. Can is processed as the main verb of the sentence and resulted in the same structure as sentence (10) of Table 1. This is shown in Figure 7 . Here, each word is presented unambiguously with can coded as a verb and fish coded as an NP. The same structure results in each case, with the average strength level much higher in the unambiguous case. Likewise, sentence ( 13), by coding fish ambiguously as a verb/NP and coding can uniquely as an auxiliary verb, produced the same structure as sentence (11). This is the structure in Figure 8. Sentence ( 14) contains the word will coded ambig- uously as an NP and an auxiliary modal verb. In the context of the sentence , it is clearly used as a modal auxiliary and the parser treats it that way. A similar result is obtained for sentence ( 15). In sentence (16), hit is coded to be ambiguous between an NP (as in playing cards) and a verb. The network correctly iden- tifies it as the main verb of the sentence. TABLE 2 Lexically Ambiguous Sentences Used in Testing Sentence Form Deductive Avg. Strength Inductive Avg. Strength (12) They