paper4.dvi Towards an expert system for  enantioseparations: induction of rules  using machine learning Bryant, CH, Adam, AE, Taylor, DR and Rowe, RC http://dx.doi.org/10.1016/0169­7439(96)00016­0 Title Towards an expert system for enantioseparations: induction of rules using  machine learning Authors Bryant, CH, Adam, AE, Taylor, DR and Rowe, RC Type Article URL This version is available at: http://usir.salford.ac.uk/1772/ Published Date 1996 USIR is a digital collection of the research output of the University of Salford. Where copyright  permits, full text material held in the repository is made freely available online and can be read,  downloaded and copied for non­commercial private study or research purposes. Please check the  manuscript for any further copyright restrictions. For more information, including our policy and submission procedure, please contact the Repository Team at: usir@salford.ac.uk. mailto:usir@salford.ac.uk Towards an Expert System for Enantioseparations� Induction of Rules Using Machine Learning C�H�Bryant�� A�E�Adam �Computation Department� D�R�Taylor �Chemistry Department� University of Manchester Institute of Science and Technology� PO Box ��� Manchester� M�� QD� United Kingdom� R�C�Rowe Zeneca Pharmaceuticals� Alderley Park� Maccles eld� Cheshire� SK � �NA� United Kingdom� Abstract A commercially available machine induction tool was used in an attempt to automate the acquisition of the knowledge needed for an expert system for enan� tioseparations by High Performance Liquid Chromatography using Pirkle�type chi� ral stationary phases �CSPs�� Various rule�sets were induced that recommended particular CSP chiral selectors based on the structural features of an enantiomer pair� The results suggest that the accuracy of the optimal rule�set is ��� � � �� which is more than ten times greater than the accuracy that would have resulted from a random choice� �Correspondence to� C�H�Bryant� School of Computing and Mathematics� The University of Hudder� s�eld� HD� �DH� United Kingdom� � � Introduction This paper presents the rst results of a project concerned with the development of an expert system for enantioseparations that is the separation of enantiomers� It describes an attempt to automate the rst step in the process of developing such as system using a technique of arti cial intelligence known as machine induction� Although machine induction has been applied to analytical chemistry before �see Section �� the authors believe that this is the rst published work to describe a validated application of machine induction to enantioseparations� The separation of enantiomers by High Performance Liquid Chromatography �HPLC� using chiral stationary phases �CSPs� is based on the formation of tran� sient diastereomeric complexes between the enantiomers of the solute and a chiral selector that is an integral part of the stationary phase� The di�erence in stability between these complexes leads to a di�erence in retention time the enantiomer that forms the less stable complex will be eluted rst� If the di�erence in stabil� ity is too small no separation is observed� Such enantioseparations are important in many scienti c disciplines including stereoselective synthesis mechanistic and catalytic studies agrochemistry medicine and pharmacology� �See ��� for a review of enantioseparations�� Since enantioseparations are performed in many disciplines and since there is a choice of over �� commercially available CSPs guidelines are needed on the choice of materials for enantioseparations by HPLC� A computer system which could guide analysts in the choice of materials for enantioseparations by HPLC would be bene� cial because there are currently few guidelines on how to choose the materials and they are di�cult to access the papers describing them are spread across a wider range of scienti c journals than analysts can be reasonably expected to survey� CHIRBASE ��� ��� ��� is a conventional database which makes data on enantiosep� arations accessible but it is expensive� Furthermore it does not tell an analyst how to use such data that is guide an analyst in the selection of materials for a particular enantioseparation� CHIRULE is a computer system that was designed to provide such guidance� CHIRULE was developed by Stau�er and is described in PhD thesis ���� It uses similarity searching on molecular properties to retrieve a list of enantiomer pairs that are chemically similar to a given enantiomer pair together with columns that have been reported in the literature to have successfully separated them� However in his thesis Stau�er does not report testing CHIRULE � to see which CSPs it would recommend when it was given enantiomer pairs which have been reported in the literature as having been separated on Pirkle�type CSPs� Pirkle�type CSPs are so named because their invention is credited to W�H�Pirkle�s group at the University of Illinois� They are also referred to as the �brush� or �multiple interaction� type� They are chiral selectors of moderate molecular weight covalently bonded to silica� As far as the authors of this work are aware this is the rst published work to have taken a validated rst step towards a computer system that gives guidance on the selection of materials for enantioseparations on Pirkle�type CSPs� The remainder of this paper describes the rst results of a project concerned with the development of an expert system for enantioseparations by HPLC� An expert system is a computer program that represents and reasons with knowledge of some specialist subject with a view to solving problems or giving advice���� The characteristics of expert systems are described in ��� together with previous expert systems for chromatography� � Machine Induction This section introduces a technique of arti cial intelligence called machine induc� tion a branch of machine learning and explains why it has been used as a rst step towards developing an expert system for enantioseparations� The original na� ture of the work described in this paper is illustrated by brie�y reviewing previous applications of machine induction to analytical chemistry� The process of acquiring the knowledge needed for an expert system is called knowledge acquisition� The knowledge acquisition process is usually divided into three stages deciding what knowledge is needed variously referred to as the def� inition stage or initial analysis� getting knowledge predominantly from human experts and interpreting it usually called elicitation� and �writing� the knowledge in the internal language of the system encoding it usually called representation� Knowledge acquisition as described above is a notoriously slow process and has become known as the �bottle�neck� in the process of developing expert systems� ��� The knowledge acquisition problem for this project initially appeared particularly severe because no human experts in the selection of materials for enantioseparations were available to work on the project� This paper describes an attempt to over� � come this problem by automating the knowledge acquisition process using machine induction� The motivation for using machine learning was the expectation that a machine learning technique might enable a computer to learn how to recommend one or more suitable CSP chiral selectors for a given enantiomer pair� The subject matter of machine learning is the study and computer modelling of learning processes� There are two fundamental reasons for studying learning to understand the pro� cess itself and to provide computers with the ability to learn� ��� One of the results of research aimed at providing computers with the ability to learn has been a num� ber of widely known machine induction algorithms such as ID� ���� ����� Some of these algorithms have been incorporated into commercially available tools such as Ex�Tran �st�Class and the one used in this project DataMariner �see Sec� tion ��� Machine induction algorithms such as that used by DataMariner take as input a set of examples known as the training set and produce as output a set of classi cation rules� These rules are of the form IF description THEN class These rules can then be used to predict the class of previously unseen examples� Each example in the training set represents an example from the domain as a set of attribute values� The same attributes must be used for all the examples� One attribute is the classi er and its values are the classes to which particular examples belong� The other attributes are known as the predicting attributes� The description in the rule antecedent usually comprises conditions on the predicting attributes� In this work the classes were CSP chiral selectors and the predicting attributes were chemical structural features� The aim was to develop a set of classi cation rules that would recommend one or more CSP chiral selectors given particular details of structural features of a given enantiomer pair�� The original nature of this work is illustrated in the remainder of this section by brie�y reviewing previous applications of machine induction to analytical chemistry� �The authors realise that an expert system for enantioseparations by HPLC would need to provide the Users of such a system with more information than just which CSP chiral selector to use� However in this work� the �rst step in the development of such a system� the recommendations were limited to CSP chiral selectors so that the experiments with machine induction would remain tractable� � Only a few references to the application of domain independent machine in� duction algorithms to induce rules for analytical chemistry domains were found in the literature� Two papers describe systems for classifying organic pollutants given their GC�MS data� Both describe the use of commercially available tools that incorporate induction algorithms based on ID� ���� ����� Derde et al� ���� used Ex�Tran to induce classi cation rules� Scott ���� successfully used �st�Class to induce classi cation and identi cation decision trees� Recently Mulholland et al� ���� used C��� an extension of ID� to induce a de� cision tree for chosing a detector when performing ion interaction chromatography� The decision tree was validated in two ways� Firstly a similar tree was generated using only ��� of the data for training and this tree was tested using the other ��� of the data� Secondly by using another test�set which was provided by a domain� expert and comprised �� pertinent examples of the ideal choice of detector as selected by that expert� The validation showed that ��� of the recommendations made by the decision tree were an exact match with the published methods and a further ��� were acceptable to the domain expert in that s�he thought that they would perform well for the given separation� The data used by Mulholland et al� originated from a database of published methods for ion chromatography� The database contained information on almost ���� applications including most of the chromatographic conditions employed� Part of this data was input to the C��� algorithm after being preprocessed� Mul� holland et al� reported that this preprocessing was the most time consuming part of the work� It is widely known within the eld of machine induction that prepro� cessing of data is often necessary� Later sections of this paper describe how the data used in this work was preprocessed� The most famous example of a machine induction system in analytical chemistry is Meta�Dendral� The work on Meta�Dendral was di�erent to the other work in analytical chemistry in that it did not utilise any domain independent induction algorithms� a machine induction system was developed as part of the project� The role of Meta�Dendral was to help a chemist determine the relationship between molecular fragmentations and the structural features of the compounds� Meta� Dendral produced rules which could be used by Dendral an expert system which uses a set of rules to reason about the domain of mass�spectrometry� The quality of the rules generated by Meta�Dendral were assessed by testing them on structures � not in the training set by consulting mass spectroscopists and by comparing them with published rules� The program succeeded in rediscovering known rules of mass� spectrometry that had already been published as well as discovering new rules� Its ability to predict spectra for compounds outside the original sets of instances was impressive� ��� � Experimental This section describes the tool used for the experiments the data input to the tool and the experiments themselves� The tool that was used in this project is called DataMariner �Release ������ ���� ����� It incorporates a rule induction algorithm which can be used to generate rules for membership of classes� The classes must be disjunctive that is membership of classes is mutually exclusive and non�hierarchical� DataMariner induces rules with the following syntax� classname rule no IF clause � clause � � � � THEN conclusion � �probability �� conclusion � �probability �� � � � The rule consequent is an implicit disjunction of clauses where each clause is a conclusion about class membership and has a probability associated with it� The rule antecedent is an implicit conjunction of clauses that is a set of clauses that are implicitly logically ANDed together� Each one of these clauses can only involve one attribute� Thus rules in which there is a disjunction involving two or more attributes are not allowed� A clause of the rule antecedent can specify the value�s� of a discrete� attribute as one of the following � discrete value �eg� detector � uv� disjunction of discrete values �eg� detector � uv OR fluorescence� negation of a discrete value �eg� detector �� uv� DataMariner comprises a number of tools� A description of some of these is given below� �Numeric attributes are allowed but they are outside the scope of this paper� � Merge This can be used to merge values of attributes� Divide Divide can be used to split the data into several training and test les so that a K�fold cross�validation can be performed� Induce This produces a set of rules describing each class in turn where the classes are sorted by the number of examples belonging to each class in descending order� The induction process continues for each class until all the examples that belong to that class are covered by the induced rules� The order of the induced rules describing each class is important� Once the rst rule has been induced for a class then all the examples which are covered by that rule are ignored when inducing the next rule� Thus an example obeys a second induced rule only if it does not obey the rst rule and does obey the second rule� Induce uses an algorithm� developed from the PRISM algorithm ����� The PRISM algorithm is described below� For each class in turn �� For each attribute�value pair calculate the probability that an example which has that value for that attribute belongs to the class� �� Select the attribute�value pair which has the largest probability and cre� ate a subset of the training set comprising all the examples which contain this attribute�value pair� �� Repeat steps � and � for this subset until it contains only examples of the class� The induced rule is a conjunction of all the attribute�value pairs used in creating the homogeneous subset� �� Remove all the examples covered by this rule from the training set� �� Repeat steps � to � until all the examples of the class have been removed� The PRISM algorithm is based on the ID� algorithm but instead of producing a decision tree it produces production rules directly� The major di�erence between ID� and PRISM is that ID� is concerned with nding the attribute which is most relevant whilst PRISM is concerned with nding the attribute� value pair which is most relevant� The problem with nding the attribute �Details of the speci�c algorithm used by Induce are not given because they could not be released by Logica� � which is most relevant is that this attribute may have some values which are irrelevant� Thus PRISM avoids a drawback of ID�� Prune This can be used to prune rules� It examines each clause in each rule starting with the last clause in a rule to test whether a clause signi cantly improves the proportion of examples correctly allocated to the class� If a clause fails this test then it is removed and the preceding clause is tested� If it does not fail then the preceding rule is tested� If all the clauses of a rule are found to make an insigni cant contribution then the whole rule is removed� Prune uses the Fisher one�tailed statistic to decide whether a clause signif� icantly improves the proportion of examples correctly allocated to the class� no domain knowledge is used to support its actions� The level of pruning can be controlled using a parameter known as the prune� level� The level can be regarded as a lter where a high gure implies that more should be retained� Pruning with the prune�level set to �� would remove all of the rules� Pruning with the prune�level set to ���� would not remove any clauses or rules although this would remove redundant conditions� Evaluate The rules induced by DataMariner can be tested using Evaluate� Evaluate uses the induced rule�set to classify some examples and compares the results with the actual classi cations that is those classi cations which are known before the rules are induced� Evaluate generates a variety of other information that guides the data analyst in identifying any problems or omissions in the rules� This information may include for example suggestions on how the values of attributes could be merged� The way in which DataMariner interprets the data given to it can be con� trolled in a number of ways� Some examples of these are described below� Data� Mariner can be instructed to � � ignore one or more attributes and their values� � treat one or more discrete attributes as ordinal types and prevent the gen� eration of disjunctive clauses containing non�contiguous values of these at� tributes� DataMariner treats a discrete variable as nominal unless it is given this instruction� � use a speci ed attribute as the classi er� � � only generate rules for a number of speci ed classes� The data that were input to DataMariner were limited to a subdomain of enantioseparations as follows� � Only analytical separations not preparative ones were considered� � Only enantioseparations by HPLC were considered� � Only the use of CSPs was considered as opposed to the addition of a chiral additive to the mobile phase� � Only successful separations� on commercially�available Pirkle�type CSPs were considered� The data were extracted from chemistry journals and literature obtained from suppliers of CSPs� The data were stored as the values of attributes� One of the attributes was es�name which represented the name of a chiral selector of a CSP� All of the remaining attributes represented instances of chemical features of an enantiomer pair� The chemical features selected and the names that were used for them are shown in Figure �� There are some features which distinguish between substructures where one or more aromatic groups are attached to a functional group and substructures where none are attached to the same type of functional group� The former are referred to as aromatic and the rst letter of the corresponding attribute name is B� The latter are referred to as aliphatic and the rst letter of the corresponding attribute name is R� There were three attributes for each chemical feature�� Each attribute contained a single character which was a digit representing the distance of an occurrence �A separation was judged to be a success if one of the following mutually exclusive conditions were true� The percentage of the separations represented by the data input to DataMariner that satis�ed each of these conditions is shown in parenthesis after each one� �� The separation factor� �� had been recorded and was greater than or equal to ��� � �� � �� The separation factor had not been recorded but resolution� Rs� had and was greater than or equal to ���� � � �� Neither the separation factor or resolution had been recorded but the literature either stated that a separation was a success or illustrated this using a chromatogram� �� � �except the number of chiral centres from the nearest chiral centre in terms of the number of connecting bonds� The three attributes for each feature were numbered � � and � to indicate that they represented the rst second and third closest occurrences respectively� This did not allow for molecules where a feature occurred more than three times a compromise had to be drawn between having a practical number of attributes and allowing for a larger number of instances� Rules were devised to ensure that structural features were represented uni� formly� These rules which are described below were obeyed for all the data that were input to DataMariner� The distance from the chiral centre was the number of connecting bonds between the nearest chiral centre and the atom of the structural feature which was closest to that chiral centre� If there were two or more chiral centres equidistant then one was arbitrarily chosen as the choice was of no consequence� For structural features which were functional groups it was the atom of the functional group itself and not an atom in a connected ring or chain which was closest� For structural features which were a double bond between carbon atoms in an alkyl chain it was whichever one of the two atoms connected by the bond was closest� With the exception of alkyl chains if a structural feature occurred at the chiral centre the distance was considered to be zero� An alkyl chain which started with a carbon atom at the chiral centre was repre� sented as that chain of carbon atoms less the one at the chiral centre the distance distance from the chiral centre being entered as one� Alkyl chains which passed through the chiral centre were conceptually split at the centre and represented as two alkyl chains each one being treated as though it had started there� The alkyl chain attributes represented all alkyl chains regardless of the degree of saturation they did not represent this� Branched chains were conceptually split into the longest straight chain and the side chains originating from it� If any of the side chains were branched then they too were split in the same manner� Thus branched side chains were split recursively until there were none remaining� Each conceptually�formed chain was represented separately� Thus branched chains were represented as a series of substituent straight chains� The way in which these substituent chains were inter�connected was not represented� The following rules were devised for functional groups� If an occurrence of a � functional group was part of a ring as distinct from attached to a ring then it was not represented as a functional group in the database� If an occurrence of a functional group was part of an occurrence of a larger functional group then the occurrence of the smaller group was not represented in the database� If two occur� rences of the same functional group or two occurrences of two di�erent functional groups shared some but not all of the same atoms then both occurrences were represented� Only amides which were derivatives of carboxylic acids in which the OH por� tion of the COOH group had been replaced by NH� �as such or substituted� were represented as amides� Thus amides could take the following forms � RCONH� primary RCONHR� primary RCONR�R�� primary RCONHCOR� secondary RCON�COR��COR�� tertiary An amide was considered to be aromatic if R R� or R�� was an aromatic group� Whenever a NH� �as such or substituted� occurred which was not part of an amide as de ned above it was represented as an amine� An amine was considered to be aromatic if one or more aromatic groups were attached to the nitrogen� Otherwise an amine was considered aliphatic� Once the data had been stored in accordance with these rules experiments were performed� DataMariner was instructed to use the attribute es�name as the classi er for all the experiments that were performed using Induce and Merge� Table � summarises the experiments performed using the tools Induce and Merge� The experiments are identi ed by numbers which correspond to the chronological order in which the experiments were performed� The rst experi� ment that was performed is referred to as test � the second as test � and so on� Tables � and � list the experiments� in such a way that similar ones are grouped �When the experiments were designed the fact that the attributes representing the alkyl chains would never have a value of � was overlooked� Consequently values such as � or at the centre or � that appear in some of the clauses generated by DataMariner that involve the alkyl chain attributes are misleading� However this oversight is of no consequence with respect to the validations performed since both the data used to test and train will not have a value of � for any of the alkyl chain attributes� �� together rather than in chronological order� The di�erence between the orders re�ects the exploratory manner in which DataMariner was used� The purpose of tests � � � and �� was to investigate the e�ect of increasing the number of classes for which DataMariner was instructed to induce rules� Tests � and �� investigated how the induced rules would di�er if DataMariner was instructed to ignore the attributes for the second and third occurrences of chemical features� Tests � �� �� and �� investigated the e�ects of merging the values of the chemical feature attributes� Tests � �� and �� explored whether the values of these attributes should be ordered� Tests �� and �� investigated what the e�ect would be of ordering the values created by merging the original values of the chemical feature attributes� Prune was used on some of the rule�sets induced during the experiments de� scribed above� Prune was used in two ways � �� To remove redundant conditions from rule�sets� This was done by setting the prune�level to ����� Table � indicates for which rule�sets Prune was used in this way by adding the extension �p��� to the name of the experiments concerned� �� To investigate the e�ects of pruning the rule�sets� Most of both the pruned and unpruned rule�sets were tested using Evaluate� All the examples from the example� le had to be used for training to ensure that the accuracy of the induced rules would be acceptable there were ��� examples belonging to �� classes giving an example to class ratio of just �� �� Since none of the examples could be used exclusively for testing Evaluate could only be used to calculate the classi cation success�rates of the rule�sets on their training sets and to cross�validate the rule�sets� The type of cross�validation performed was a K�fold cross�validation where K was equal to ten� Table � shows some of the statistics that were calculated when the le used for testing was identical to that which had been used for training and Table � shows the the statistics that were estimated using cross�validation� In addition to being cross�validated the rule�set induced during test �� was manually validated� That is a paper exercise was used rather than Evaluate� This exercise will be referred to as the external validation because the rule�set was tested on �� enantioseparations that were not stored in the example��le used �� by DataMariner� These enantioseparations were reported in sources similar to those from which the data in the example� le originated� The choice of enantiomer pairs was restricted to those which had been separated on one of the CSP chiral selectors for which rules had been generated by DataMariner� The external validation compared for some enantioseparations not stored in the example��le the CSP chiral selectors recommended by the rule�set� induced during test �� with the choice of selector reported in the literature� The aim of the external validation was to prove that the cross�validation correctly simulated the e�ects of testing with unseen data� � Results and Discussion In tests ��� DataMariner induced rules whose clauses speci ed not only whether a particular occurrence of a chemical feature was present and if so how far it was from the chiral centre but also whether the occurrence was the rst second or third closest occurrence of that chemical feature� This author believes that in some cases it may not matter whether a chemical feature is the closest second closest or third closest occurrence of that feature as long as the feature is present at a particular distance or within a range of distance values� However DataMariner could not have induced rules that represented this because it could not induce rules in which there was a disjunction of attributes� For example DataMariner could not have induced a clause such as cooh� OR cooh� OR cooh� � � In tests ���� DataMariner was instructed to ignore all the attributes that represented the second and third occurrences so that rules would be induced that reasoned about the presence of the nearest occurrences only� The e�ects of ignoring the second and third occurrences can be analysed by comparing tests � and � as these were identical in every other respect� When the second and third occurrences were ignored the number of rules increased very slightly whilst the classi cation success�rate on the training set remained at ����� This suggested that providing DataMariner with data on the second and third occurrences did not result in �Only the recommendations of the �rst of the rules in the rule�set that could �re were considered� �� better rules� Consequently DataMariner was instructed to ignore the second and third occurrences in all the remaining experiments� The e�ects of using di�erent ordinal types are shown by tests � � and ��� These tests were identical except for the data types used for the chemical feature attributes� The attributes had discrete values in all three tests but in tests � and �� the values were ordered� The use of the ordinal types reduced the number of rules from �� in test � to � in both tests � and ��� The average number of clauses per rule rose from three in test � to �� in tests � and ��� In test � the order was not present � � � � � � �� The order in test �� was the same except that not present came after �� This makes more chemical sense because not present can be considered to be the case where a chemical feature is an in nite number of bonds away� Tests � and �� suggested that some of the values of the chemical feature attributes should be merged� Consider the rst rule induced during test � which is shown in Figure � � All the clauses that are disjunctions include the values � and �� This was re�ected across the rest of the rule�set� �� out of the �� disjunctions in the rule�set included these values which suggests that they should be merged� Test �� suggested that the values � and � should be merged and that the values � � � and � should be merged� the rst rule from test �� is shown in Figure � and illustrates this� There were �� disjunctions in the rule�set for test��� Eight of these suggested that � and � should be merged and �� that � � � and � should be merged� The e�ects of merging the values of the chemical feature attributes can be analysed by comparing tests �� �� �� and ��� These tests were identical except for the way in which values were merged� The rules that were induced in test �� in which no values were merged are very speci c� They include many precise statements about the distance of chemical features from the chiral centre� Consider the rst rule that was induced which is shown in Figure �� The clauses involving bx� state that bx� should not equal � � or �� The clause involving boh� states that boh� should equal not present or �� The rules generated during test �� seem chemically implausible because they are very precise about the distances� �This rule� and the one shown in Figure �� has a redundant clause� cen � � serves no purpose because it appears after another clause cen � � OR �� Such redundancy could have been removed see Section �� but this was would have been irrelevant to the purpose of the experiment� �� In test �� the values � � � � � � � of all the chemical feature attributes were merged to the value present� In test �� the values � and � were merged to � or � bonds away and � � � and � were merged to more than ve bonds away� In test �� the merges performed in test �� were repeated and in addition the values � and � were merged to at the centre or �� The accuracies calculated by cross�validation for all four tests were indistinguishable but the number of rules for the classes did vary� Merging all the values to the value present increased the number of rules from �� to �� that is by ���� Merging some of the values led to a slight increase �� in test �� and �� in test ��� The e�ects of merging the values can also be seen by comparing tests �� and ��� Test �� was similar to test ��� identical merges were performed in both but in test �� the values were ordered after they were merged� The e�ects caused by the merge used in test �� can be considered in isolation by comparing the results of tests �� and �� given the merges performed the ordinal types used in these tests are e�ectively the same� Figures � and � show two comparable rules from tests �� and �� respectively� These gures show that merging values results in more general rules� Consider the respective clauses for the attribute rconh�� In test �� the clause is as follows� rconh� � � OR OR not�present In test �� this is generalised to the following� rconh� � more�than�five�bonds�away OR not�present Table � lists the results of the cross�validation� It shows that the accuracies were all more than ten times greater than the accuracy that would result from choosing one of the selectors at random� The tests that were cross�validated di�ered only in the merges that were per� formed and the ordinal types that were used� Table � shows that for any two of the tests that were cross�validated �pA � �pA �� �pB � �pB where A is the test with largest �p value and B is the other test� Hence the estimates of accuracy for these tests are indistinguishable the values for �p are too This is explained later in this section� �� close given the values of �p� This suggests that using merges or ordinal values did not a�ect the accuracy of the resulting rules� Table � shows some of the results of the external validation performed on the rule�set induced during test ��� It indicates the extent of the agreement on the choice of CSP chiral selector between the literature and the rule�set induced during test ��� Tables � to �� list the names and structures of the enantiomer pairs used in the external validation and show the diverse range of structures used� Only for two of the �� enantiomer pairs ���� did the rule�set fail to recommend the choice of CSP chiral selector reported in the literature� The two enantiomer pairs concerned are Labetolol and N���FMOC� ��benzoylglycine �N�phenylamide�� In both these cases the rule�set failed to recommend any CSP chiral selector� The choice of CSP chiral selector reported in the literature was either the rst or second choice recommendation of the rule�set for �� of the �� enantiomer pairs ������ The choice of selector reported in the literature was the rst choice of the rule�set for �� of the �� enantiomer pairs ������ The accuracy calculated using just the rst choice of the rule�set is most comparable to the cross�validation result for test �� since Evaluate calculates accuracy by assigning each example to the class with the highest probability associated with it amongst all the rules that can re� The cross�validation result for test �� was ��� � � �� and the accuracy calculated during the external validation using just the rst choice was ���� Hence the cross�validation and external validation are mutually corroborative the di�erence between the upper limit of the cross�validation result and the external validation result is only ��� The analysis of the experiments with Prune was di�cult� The developers of DataMariner acknowledge that a possible consequence of pruning is that excep� tion relationships that are correct but rare can be eliminated� They recommend that pruned and unpruned rules should always be checked to con rm that no valu� able information has been lost ����� It is not easy to provide a chemical justi cation for the rules that were induced as part of this work by looking at the rules them� selves� Consequently it is impossible to check that Prune did not result in the loss of valuable information� Prune can be used to remove clauses or rules that are induced as a result of noise in an example� le ����� However the rule�sets could not have been improved signi cantly by Prune the example� le was carefully and meticulously prepared� �� Prune has as great a potential to have an adverse e�ect as it does to have a bene cial one because it relies solely upon a statistical test to support its actions� it can not distinguish between a clause whose presence is due to noise and one whose presence is due to an exceptional relationship which is correct but rare� Recall that this paper is concerned with the knowledge acquisition phase of de� veloping an expert system for enantioseparations rather than the implementation of such a system� Therefore a detailed discussion of the phases that must follow the knowledge acquisition phase is consigned to further work� the remainder of this section brie�y indicates how the optimal rule�set induced during test �� could be used� The authors believe that the con�ict resolution strategy ��� that follows should be adopted given the induction algorithm used by Induce �see Section ��� �� Try to re each rule in turn until a rule res� �� Let the rst choice recommendation of the rule�set be the CSP chiral selector in the consequent of the rule that red which has the highest probability associated with it� �� If the consequent of the rule that red is a disjunction of CSP chiral selectors then let the second choice recommendation be the selector in the consequent that has the second highest probability associated with it� Let the third choice be the one with the third highest probability and so on� Such a strategy could be used to generate an ordered list of recommended CSP chiral selectors whenever the consequent of the rule that res is a disjunction� This would suit analysts as they would then be free to either try each selector in the list in turn starting with the rst choice of the rule�set or to choose selectors from the list using other criteria such as cost or availability in their laboratory� � Conclusions The optimal rule�set must � � have rules for membership of all the classes that is CSP chiral selectors� � be induced using an ordinal type which re�ects the inherent order in the distance values and allows not present to be considered as the case where a chemical feature is an in nite number of bonds away� �� Rule�sets induced when such an ordinal type is used are smaller and have rules where the average number of clauses is much larger than the corre� sponding rule�sets which are induced when ordinal types are not used but the experimental conditions are otherwise identical� � be induced using the following merges that were suggested by Evaluate� � and � merged to at the centre or � � and � merged to � or � � � � and � merged to more than ve� Unless merges are performed the induced rules include clauses that are too precise about the distances� The merges suggested by Evaluate make chem� ical sense and result in more general and plausible rules� The rule�set induced during test �� ful lls these requirements and so is the optimal rule�set� DataMariner was successfully used to induce and validate rules that rec� ommended particular CSP chiral selectors based on the structural features of an enantiomer pair� Although it is not easy to provide a chemical justi cation for the rules by looking at them the results suggest that they have a high degree of accu� racy� The cross�validation performed on the optimal rule�set induced suggests that this rule�set would recommend as its rst choice a correct CSP chiral selector for ��� � � �� of enantiomer�pairs that can be separated on Pirkle�type CSPs� The ex� ternal validation which used test data that had not been input to DataMariner supported the results of the cross�validation� The accuracy of the optimal rule�set is more than ten times greater than the accuracy that would result from choosing one of the selectors at random� The external validation suggests that either the rst or second choice recommendation of the optimal rule�set would be correct for ��� of enantiomer pairs that can be separated on Pirkle�type CSPs� � Acknowledgements The funding was provided by EPSRC under the remit of the Total Technology programme and by Zeneca Pharmaceuticals� R� Dallaway and I�T�Nabney of Logica Cambridge Ltd� provided helpful advice on the use of DataMariner� C� Bryant is grateful for this and the hospitality shown to him by all at Logica Cambridge Ltd� �� G�V� Conroy from the Computation Department at UMIST made some valu� able comments on the machine induction aspects of this work� D�J� Williams from the Chemistry Department at UMIST prepared the dia� grams of the chemical structures� � References ��� D�R� Taylor and K� Maher Chiral Separations by High�Performance Liquid Chromatography� Journal of Chromatographic Science �� ������ ������ ��� C� Roussel and P� Piras CHIRBASE A Molecular Database for Storage and Retrieval of Chromatographic Chiral Separations� Pure � Applied Chem� istry �� ������ �������� ��� B� Koppenhoefer A� Nothdurft J� Pierrot�Sanders P� Piras C� Popescu C� Roussel M� Stiebler and U� Trettin CHIRBASE a Graphical Molecular Database on the Separation of Enantiomers by Liquid� Supercritical Fluid� and Gas Chromatography� Chirality � ������ �������� ��� B� Koppenhoefer R� Graf H� Holzschuh A� Nothdurft U� Trettin P� Piras and C� Roussel CHIRBASE a Molecular Database for the Separation of Enantiomers by Chromatography� Journal of Chromatography ��� ������ �������� ��� S�T� Stau�er� Expert System Shells in Chemistry� CHIRULE� a Chiral Chromatographic Column Selection System using Similarity Searching and Personal Construct Theory� PhD Thesis� Virginia Polytech Ins� State Univ� USA� ������ ��� P� Jackson Introduction to Expert Systems� �nd Ed� Addison�Wesley ����� ��� C�H� Bryant A�E� Adam D�R� Taylor and R�C� Rowe A Review of Expert Systems for Chromatography� Analytica Chimica Acta ��� ������ �������� ��� D� Diaper Knowledge Elicitation� Principles� Techniques and Applications� Ellis Horwood ����� ��� S� Kocabas A Review of Learning� Knowledge Engineering Review � ������ �������� ���� J�R� Quinlan Discovering Rules from Large Collections of Examples a Case Study� in D� Michie� �Ed� Expert Systems in the Micro Electronic Age� Edinburgh University Press� Edinburgh ����� ���� J�R� Quinlan Learning E�cient Classi cation Procedures and their Appli� cation to Chess End Games� in R�S� Michalski� J�G� Carbonell� and T�M� Mitchell� �Ed�s Machine Learning An Arti cial Intelligence Approach� Palo Alto� Tioga ����� � ���� M� Derde L� Buydens C� Guns and D�L� Massart Comparison of Rule� Building Expert Systems with Pattern Recognition for the Classi cation of Analytical Data� Analytical Chemistry �� ������ ���������� ���� D�R� Scott Classi cation and Identi cation of Mass Spectra of Toxic Com� pounds with an Inductive Rule�Building Expert System and Information Theory� Analytica Chimica Acta ��� ������ �������� ���� M� Mulholland D�B� Hibbert P�R� Haddad C� Sammut Application of the C��� Classi er to Building an Expert System for Ion Chromatography� Chemometrics and Intelligent Laboratory Systems �� ������ ������� ���� Logica UK Limited� DataMariner� User Manual� Version B ����� ���� I�T� Nabney and O� Grasl Rule Induction for Data Exploration� in Pro� ceedings of Avignon �� Expert systems and their applications � ������ �������� ���� J� Cendrowska PRISM An Algorithm for Inducing Modular Rules� Inter� national Journal of Man�Machine Studies �� ������ �������� ���� W�H� Pirkle T�C� Pochapsky G�S� Mahler and R�E� Field Chromato� graphic Separation of the Enantiomers of ��Carboalkoxyindolines and N� Aryl���amino Esters on Chiral Stationary Phases Derived from N��� �� Dinitrobenzoyl����amino Acids� Journal of Chromatography ��� ������ ��� ��� ���� I�W� Wainer and M�C� Alembik Steric and Electronic E�ects in the Res� olution of Enantiomeric Amides on a Commercially Available Pirkle�Type High�Performance Liquid Chromatographic Chiral Stationary Phase� Jour� nal of Chromatography ��� ������ ������ ���� L�E� Weaner and D�C� Hoerr Separation of Fatty Acid Ester and Amide Enantiomers by High�Performance Liquid Chromatography on Chiral Sta� tionary Phases� Journal of Chromatography ��� ������ �������� ���� R� Dernoncour and R� Azerad High Performance Liquid Chromatographic Separation of the Enantiomers of Substituted ��Aryloxypropionic Acid Methyl Esters� Journal of Chromatography ��� ������ �������� ���� A� Berthod H�L� Jin A�M� Stalcup and D�W� Armstrong Interactions of Chiral Molecules With an �R��N��� ��Dinitrobenzoyl� Phenylglycine HPLC Stationary Phase� Chirality � ������ ������ �� ���� W�H� Pirkle and J�E� McCune Separation of the Enantiomers of N� Protected ��amino Acids as Anilide and � ��dimethylanilide Derivatives� Journal of Chromatography ��� ������ �������� ���� Phenomenex Ltd� U�K� The Arsenal Heapy Street Maccles eld Cheshire SK�� �JB� Table � Summary of experiments that used the Induce and Merge tools of DataMariner� Table � Results of experiments that used the Induce Merge and Prune Tools of DataMariner� Statistics calculated by Evaluate for each rule�set as a whole� �The le used for testing was identical to that which had been used for training�� Table � Results of cross�validation� Statistics on the accuracy with which the rule�sets induced from the training les classify examples from the test les� Table � Results of external validation� Number of occurrences of di�erent rankings� Table � Some of the data that were used in the external validation� Some of the enantiomer pairs for which �R��N��� ��dinitrobenzoyl�phenylglycine was both the �rst choice recommendation of the optimal rule�set and the chiral selector used in the separations reported in the literature� Table � Some of the data that were used in the external validation� Some of the enantiomer pairs for which �R��N��� ��dinitrobenzoyl�phenylglycine was both the �rst choice recommendation of the optimal rule�set and the chiral selector used in the separations reported in the literature� Table � Some of the data that were used in the external validation� Some of the enantiomer pairs for which �R��N��� ��dinitrobenzoyl�phenylglycine was both the �rst choice recommendation of the optimal rule�set and the chiral selector used in the separations reported in the literature� Table � Some of the data that were used in the external validation� The enantiomer pairs for which �S��N��� ��dinitrobenzoyl�leucine was both the second choice rec� ommendation of the optimal rule�set and the chiral selector used in the separations reported in the literature� Table � Some of the data that were used in the external validation� The enan� tiomer pairs for which the chiral selector used in the separations reported in the literature was neither the rst or second choice recommendation of the optimal rule�set� Table �� Some of the data that were used in the external validation� The enan� tiomer pairs for which the optimal rule�set did not make any recommendations� Figure � The chemical features of enantiomer pairs that were input to Data� Mariner and the names that were used for them� Figure � One of the rules induced during test �� Figure � One of the rules induced during test ��� Figure � One of the rules induced during test ��� Figure � One of the rules induced during test ��� Figure � One of the rules induced during test ��� � Test No� of �nd � �rd Values Merged Ordinal chiral occurrences values selectors of chemical All a Some b Some c None for which features of rules were enantiomer induced pairs ignored � � no no no no yes none � � no no no no yes none � � no no no no yes none � � yes no no no yes none �� �� yes no no no yes none � � yes no no no yes yesd �� � yes no no no yes yese �� �� yes no no no yes yese � � yes yes no no no none �� �� yes yes no no no none �� � yes yes no no no none �� �� yes no yes no no none �� �� yes no yes no no yesf �� �� yes no no yes no none �� �� yes no no yes no yesg aThe values �� �� � � � � � of the chemical feature attributes were merged to the value present� bTwo merges were performed on the chemical feature attributes� and � were merged to or � bonds away �� �� �� and � were merged to more than �ve bonds away� cThree merges were performed on the chemical feature attributes� � and � were merged to at the centre or � and � were merged to or � �� �� �� and � were merged to more than �ve� dThe following order was speci�ed for the values of each of the chemical feature attributes� not present� �� �� �� � � � �� eThe following order was speci�ed for the values of each of the chemical feature attributes� �� �� �� � � � �� not present� fThe following order was speci�ed for the values of each of the chemical feature attributes� �� �� �� �� or � bonds away� more than �ve bonds away� not present� gThe following order was speci�ed for the values of each of the chemical feature attributes� at the centre or �� �� �� or � bonds away� more than �ve bonds away� not present� Table �� Summary of experiments that used the Induce and Merge tools of Data� Mariner Test No� of Average Overall No� of No� of Rules no� of accuracy miscla� uncla� clauses � ssi ed ssi ed per rule rings eg�s � �� � ��� � ��� � �� � ��� � ��� ���p��� �� � �� � �� � � �� �� �� �� �� � �� �� �� �� ���p��� �� � �� �� �� ��p��� �� � ��� � ��� ���p��� �� � �� � �� ���p��� �� � �� � �� ���p��� �� � �� �� �� ���p��� �� � �� � �� ���p��� �� � �� �� �� Table �� Results of experiments that used the Induce� Merge and Prune Tools of DataMariner Statistics calculated by Evaluate for each rule�set as a whole �The �le used for testing was identical to that which had been used for training � Test Accuracy � Mean Standard Error �� �� � �� �� � �� �� � �� �� � �� �� � �� �� � �� �� � Table �� Results of cross�validation Statistics on the accuracy with which the rule�sets induced from the training �les classify examples from the test �les Ranka Enantiomer Pairs with this Ranking Number �Number x ���� � �� � �� �� � � �� � � � � � � � � � � � � � � � � No rules red � � aRank that was assigned by the rule�set induced during test �� for the choice of CSP chiral selector reported in the literature� Table �� Results of external validation Number of occurrences of di�erent rankings Enantiomer Pair Refa Name Structure ethyl N�phenyl phenylglycine ���� N O O H ���ethoxycarbonyl�indoline ���� N H H O O N�phenyl���methylheptanamide ���� N O H N����phenylethyl����naphthylamide ���� O N H N����naphthoyl���aminohexan���ol ���� O N HO CH3 ( CH 2 ) 3 H N����naphthoyl����methylbut���ylamine ���� N H O aLiterature reference for the separation that was reported in the literature� Table �� Some of the data that were used in the external validation Some of the enan� tiomer pairs for which �R��N������dinitrobenzoyl�phenylglycine was both the �rst choice recommendation of the optimal rule�set and the chiral selector used in the separations reported in the literature Enantiomer Pair Refa Name Structure benzoylmethyl ��tetradecylglycidate ���� TDGA ester derivative O O O O CH 3 ( CH 2 ) 1 2 CH 2 methyl ��hexylglycidate ���� TDGA analogue O O O n - C 6 H 1 3 methyl ���� ��dichlorophenoxy�propanoate ���� OC l C l O O methyl �����methylphenoxy�propanoate ���� O O O methyl �����naphthoxy�propanoate ���� O O O aLiterature reference for the separation that was reported in the literature� Table �� Some of the data that were used in the external validation Some of the enan� tiomer pairs for which �R��N������dinitrobenzoyl�phenylglycine was both the �rst choice recommendation of the optimal rule�set and the chiral selector used in the separations reported in the literature Enantiomer Pair Refa Name Structure N�������naphthyl��ethyl� � ��dichloroacetamide ���� N O C l C l H N�acetyl �����naphthyl�ethylamine ���� N O H N�chloroacetyl���aminoindane ���� H N H O C l N��acetylmethionine �N�����naphthyl�amide� ���� N O HN H O S N�������naphthyl�ethyl� acetamide ���� N O H aLiterature reference for the separation that was reported in the literature� Table �� Some of the data that were used in the external validation Some of the enan� tiomer pairs for which �R��N������dinitrobenzoyl�phenylglycine was both the �rst choice recommendation of the optimal rule�set and the chiral selector used in the separations reported in the literature Enantiomer Pair Refa Name Structure N���CBZ�alanine �N�phenylamide� ���� O O N H H O N N���BOC�alanine �N� � ��dimethylphenylamide� ���� N ON O O H H N���FMOC�alanine �N� � ��dimethylphenylamide� ���� O O N O N H H aLiterature reference for the separation that was reported in the literature� Table �� Some of the data that were used in the external validation The enantiomer pairs for which �S��N������dinitrobenzoyl�leucine was both the second choice recommendation of the optimal rule�set and the chiral selector used in the separations reported in the literature Enantiomer Pair Refa Name Structure Metoprolol ���� O O OH N H Bepridil ���� N O N Timolol ���� O N S NN O HO N H aLiterature reference for the separation that was reported in the literature� Table � Some of the data that were used in the external validation The enantiomer pairs for which the chiral selector used in the separations reported in the literature was neither the �rst or second choice recommendation of the optimal rule�set Enantiomer Pair Refa Name Structure N���FMOC� ��benzoylglycine �N�phenylamide� ���� O O N H O O N H Labetolol ���� O H 2 N HO OH N H aLiterature reference for the separation that was reported in the literature� Table � � Some of the data that were used in the external validation The enantiomer pairs for which the optimal rule�set did not make any recommendations Chemical Feature Name Chemical Feature Name number of chiral centres Cen alkyl chain of length � C aliphatic OH Roh alkyl chain of length � Cc aromatic OH Boh alkyl chain of length � Ccc COOH Cooh alkyl chain of length � Cccc ester Ester alkyl chain of length � � C���c aldehyde Ald alicyclic � membered ring Rg� ketone Ket alicyclic membered ring Rg aliphatic amide Rconh alicyclic � membered ring Rg� aromatic amide Bconh other alicyclic ring Rg aliphatic amine Rnh aromatic membered ring Bg aromatic amine Bnh aromatic � membered ring Bg� nitro No� other aromatic ring Bg cyanide�nitrile Cn bicyclic ring Bic thio Rsr tricyclic ring Tri sulphinyl Rsor polycyclic ring Ply sulphonyl Rso�r hetero N Nhe aliphatic X Rx hetero O Ohe aromatic X Bx hetero S She ether Ror other hetero atom �he carbon carbon double bond Cdbc Figure �� The chemical features of enantiomer pairs that were input to DataMariner and the names that were used for them �R� N ��� dinitrobenzoyl�phenylglycine rule � IF rnh� � not�present OR � OR � no��� � not�present cdbc�� � not�present OR � OR � est� � not�present OR � OR � ket� � not�present OR � OR � OR � ald� � not�present nhe� � not�present OR � OR � OR � OR � rso�r�� � not�present she� � not�present cooh� � not�present OR � OR � rx� � not�present OR � OR � OR � cc�� � not�present OR � OR � OR � OR � bx� � not�present OR � OR � OR � OR � OR � OR OR � cen � � OR � c�� � not�present OR � OR � OR � OR � OR � OR cen � � rg��� � not�present OR � OR � cn� � not�present THEN es�name � �R� N ��� dinitrobenzoyl�phenylglycine ��� �� es�name � �S� N ��� dinitrobenzoyl�leucine ������ es�name � �R� N � �alpha naphthyl�ethylaminocarbonyl �S�indoline � carboxylic�acid ������ Figure �� One of the rules induced during test �R� N ��� dinitrobenzoyl�phenylglycine rule � IF no��� � not�present boh� � � OR � OR � OR OR not�present ket� � � OR � OR � OR OR � OR � OR � OR OR not�present bx� � � OR OR � OR � OR � OR OR not�present rso�r�� � not�present she� � not�present cen � � OR � bg �� � � OR � OR � OR OR � OR � OR � OR OR not�present cen � � cdbc�� � not�present cn� � not�present THEN es�name � �R� N ��� dinitrobenzoyl�phenylglycine ������ es�name � �S� N ��� dinitrobenzoyl�leucine ������ es�name � �R� N � �alpha naphthyl�ethylaminocarbonyl �S�indoline � carboxylic�acid ������ Figure �� One of the rules induced during test � �R� N ��� dinitrobenzoyl�phenylglycine rule � IF no��� � not�present ket� � not�present OR � cooh� � not�present cdbc�� � not�present OR � boh� � not�present OR � rnh� � not�present ohe� � not�present OR � bic� �� � she� � not�present bx� �� bx� �� � cc�� �� � nhe� �� � c�� �� � cn� � not�present bx� �� rconh� � not�present c���c�� �� � bic� �� � cen � � bg��� �� not�present ror� �� � cc�� �� � THEN es�name � �R� N ��� dinitrobenzoyl�phenylglycine ������ Figure �� One of the rules induced during test �� �R� N � �alpha naphthyl�ethylaminocarbonyl �S�indoline � carboxylic�acid rule � IF bic� � � OR � OR OR � OR � OR � OR OR not�present est� � not�present nhe� � � OR � OR � OR OR not�present cooh� � not�present bconh� � not�present rconh� � � OR OR not�present cdbc�� � not�present tri� � not�present ply� � not�present ror� � � OR � OR � OR OR � OR � OR � OR OR not�present cen � � THEN es�name � �R� N � �alpha naphthyl�ethylaminocarbonyl �S�indoline � carboxylic�acid ������ Figure �� One of the rules induced during test �� �R� N � �alpha naphthyl�ethylaminocarbonyl �S�indoline � carboxylic�acid rule � IF bic� � � OR ��or� �bonds�away OR more�than� �bonds�away OR not�present est� � not�present nhe� � more�than� �bonds�away OR not�present cooh� � not�present bconh� � not�present rconh� � more�than� �bonds�away OR not�present cdbc�� � not�present tri� � not�present ply� � not�present ror� � � OR � OR ��or� �bonds�away OR more�than� �bonds�away OR not�present cen � � THEN es�name � �R� N � �alpha naphthyl�ethylaminocarbonyl �S�indoline � carboxylic�acid ������ Figure �� One of the rules induced during test ��