Rule-based System.doc A Rule-Based System for Test Quality Improvement Gennaro Costagliola, Vittorio Fuccella Dipartimento di Matematica e Informatica, Università degli Studi di Salerno, Fisciano (SA), Italy. ABSTRACT To correctly evaluate learners’ knowledge, it is important to administer tests composed of good quality question items. By the term “quality” we intend the potential of an item in effectively discriminating between skilled and untrained students and in obtaining tutor’s desired difficulty level. This paper presents a rule- based e-testing system which assists tutors in obtaining better question items through subsequent test sessions. After each test session, the system automatically detects items’ quality and provides the tutors with advice about what to do with each of them: good items can be re-used for future tests; among items with lower performances, instead, some should be discarded, while some can be modified and then re-used. The proposed system has been experimented in a course at the University of Salerno. Keywords: e-Testing, Computer Aided Assessment, CAA, item, item quality, questions, eWorkbook, Item Response Theory, IRT, Item Analysis, online testing, multiple choice test. 1. INTRODUCTION E-testing, also known as Computer Assisted Assessment (CAA), is a sector of e-learning aimed at assessing learner’s knowledge through computers. Through e-testing, tests composed of several question types can be presented to the students in order to assess their knowledge. Multiple choice question type is frequently employed, since, among other advantages, a large number of tests based on it can be easily corrected automatically. The experience gained by educators and the results obtained through several experiments (Woodford & Bancroft, 2005) provide some guidelines for writing good multiple choice questions (items, in the sequel), such as: “use the right language”, “avoid a big number of unlikely distractors for an item”, etc. It is also possible to evaluate the effectiveness of the items, through the use of several statistical models, such as Item Analysis (IA, 2008) and Item Response theory (IRT). Both of them are based on the interpretation of statistical indicators calculated on test outcomes. The most important indicators are the difficulty indicator, which measures the difficulty of an item, and the discrimination indicator, which represents the information of how effectively an item discriminates between skilled and untrained students. More statistical indicators are related to the distractors (wrong options) of an item. A good quality item has a high discrimination potential and a difficulty level close to tutor’s desired one. Despite the availability of guidelines for writing good items and statistical models to analyze their quality, only a few tutors are aware of the guidelines and even fewer are used with statistics. The result is that the quality of the tests used for exams or admissions is sometimes poor and in some cases could be improved. The most common Web-based e-learning platforms, such as Moodle (Moodle, 2008), Blackboard (Blackboard, 2008), and Questionmark (Questionmark, 2008) evaluate item quality by generating and showing item statistics. Nevertheless, their interpretation is left to the tutors: these systems do not advise or help the tutor in improving items. In this paper we propose an approach and a system for improving items: we provide tutors with feedback on their quality and suggest them the opportune action to undertake for improving it. To elaborate, the approach consists of administering tests to learners through a suitable rule-based system. The system obtains item quality improvement by analyzing the test outcomes. After the analysis, the system provides the tutor with one of the following suggestions: • “Keep on using the item” in future test sessions, for good items; • “Discard the item”, for poor items; • “Modify the item”, for poor items whose defect is originated by a well-known cause. In this case, the system also provides the tutor with suggestions on how to modify the item. Though item quality can be improved after the first test session in which it is used, the system can be used for subsequent test sessions, obtaining further improvements. Rule-based systems are generally composed of an inferential engine, a knowledge-base and a user interface. Our system follows this model. The inferential engine works by exploiting fuzzy classification: the items are classified on the basis of the values of some parameters calculated on test outcomes. Fuzzy classification has been successfully employed in technological applications in several sectors, from weather forecast (Bradley et al.; 1982) to medical diagnosis (Exarchos et al.; 2007). In our system, it has been preferred over other frequently used classification methods, such as decision trees and Bayesian classifier due to the following reasons: • Knowledge availability. Most of the knowledge is already available, as witnessed by the presence of numerous theories and manuals on psychometrics. • Lack of data. Other types of classification based on data would require the availability of large data sets. Once they have been gathered, in such a way to have statistically significant classes to perform data analysis, such methods might be exploited. The knowledge-base of the system has been inferred from IA and other statistical models for the evaluation of the items. The system has been given a Web-based interface. Rather than developing it from scratch, we have preferred to integrate the system in an existing Web-based e-testing platform: eWorkbook (Costagliola et al.; 2007), developed at University of Salerno. An experiment on system’s performances has been carried out in a course at the University of Salerno. As shown in the experiment, we can obtain items which better discriminate between skilled and untrained students and better match the difficulty estimated by the tutor. The paper is organized as follows: section 2 presents a brief survey on fuzzy classification; section 3 describes the statistical models for evaluating the effectiveness of the items; the approach for item quality improvement is presented in section 4. In section 5, we describe the system: its architecture and its instantiation in the existing e-testing platform; section 6 presents an experiment and a discussion on its results; section 7 contains a comparison with work related to ours; lastly, several final remarks and a discussion on future work conclude the paper. 2. FUZZY CLASSIFICATION The approach presented in this paper employs a fuzzy classification method. Classification is one of the most widespread Data Mining techniques (Roiger & Geatz; 2004). It lies in grouping n entities of a given knowledge domain into m knowledge containers, often called classes, sections, categories, etc. To perform a classification, several attributes of the entities must be analyzed. These are called input attributes. The class in which the entity will be inserted is an output attribute. A good classification consists of classes with high internal cohesion and external separation. Classification differs from clustering. The difference lies in the final classes, which are predefined only for the former problem. In clustering, instead, the classes (clusters) are discovered during the process. For this reason, we say that classification is a supervised process. Classification has been employed in several fields for solving real problems, such as: • In medicine, for medical diagnosis; • In pattern discovery, for fraud detection. E.g., the FALCON system (Brachman et al.; 1996), created by the HNC Inc. is used for detecting possible transaction with false credit cards; • In economy and financing, for risk management, for classifying the credit risk of a person who has requested funds. Several methods can be used for classification. Some of them, such as decision trees, use machine learning for extracting knowledge from data. The most frequently used machine learning approaches divide data in two sets: the training set and the test set. The former is used to produce the knowledge, the latter to test the effectiveness of the approach. The decision tree lends itself to be used in classification, but it gives just one output categorical attribute. Furthermore, the decision tree produces particularly easy to explain results and can be suitable in the case of unknown data distribution. Nevertheless, it can be advisable to employ other methods, such as the Bayesian classifier, when all or most of the input attributes are numerical: the tree could have too many conditional tests to satisfy to be informative. When data are missing, and the knowledge is already available, a rule-based system is a suitable solution for classification. A rule-based systems is a system whose knowledge-base is expressed under the form of production rules. Rule-based systems have been employed in many applications for decision making. Such systems can also be used for classification. The production rules can be inferred directly from the expertise or obtained through machine learning methods. In general, the rules are in the following form: IF THEN The antecedent conditions define the values or the value intervals for one or more input attributes. The consequent conditions define the values or the value intervals for one or more output attributes. In the case of classification, the consequent conditions determine if a given entity belongs to a class. In rule-based systems, it is often necessary to deal with uncertainty. To this aim, fuzzy logic is often employed, e.g. it has been used for economic performance analysis (Zhou et al.; 2005). Fuzzy logic is derived from fuzzy sets theory. Fuzzy sets were first introduced by Zadeh (1997), and have been applied in various fields, such as decision making and control (Bardossy & Dukstein; 1995). Fuzzy set theory deals with reasoning that is approximate rather than precisely deduced from classical predicate logic. A fuzzy set is characterized by a membership function which maps a value that might be a member of the set to a number between zero and one indicating its actual degree of membership. The triangular membership function is the most frequently used function and the most practical, but other shapes, continuous or discrete, are also used. A variable used in a fuzzy production rule is also called a linguistic variable and is associated to a linguistic value (term). Each linguistic value is associated to a fuzzy set. A fuzzy system is a set of fuzzy rules connecting fuzzy input and fuzzy output in the form of IF- THEN sentences. Once we have the rules and the fuzzy sets for defining the values of the linguistic variables, the fuzzy inference can be applied. The most commonly applied method is the 4-phases procedure introduced by Mamdani & Assilian (1999). The four steps are the following: • Fuzzyfication: conversion of the input values in the corresponding membership levels in each fuzzy set; • Inference: the membership levels are combined in order to obtain a degree of fulfillment for each rule; • Combination: combination of all the values obtained for the rules to obtain a unique fuzzy set; • Defuzzyfication: conversion of the fuzzy set obtained at the previous phase into a value. A fuzzy classifier is a function that at each entity associates a set of Boolean functions defining the possibility (Degree of Fulfillment, DoF briefly) that an instance belongs to the output classes. The fuzzy classifier produces a categorical value as final output. Often, the classification is performed by selecting the class for which the DoF is the highest. This method corresponds to the case of maximum method for combination and maximum method for defuzzyfication. 3. ITEM QUALITY: ITEM AND DISTRACTOR ANALYSIS This section describes the main statistical models on which the knowledge-base of our system is based. In particular, it focuses on IA, whose statistical indicators are used in our system’s rules. The tests administered through our system make use of multiple choice items for the assessment of learners’ knowledge. Those items are composed of a stem and a list of options. The stem is the text that states the question. The only correct answer is called the key, whilst the incorrect answers are called distractors (Woodford & Bancroft, 2005). As mentioned in the introduction, two main statistical models are available for evaluating item quality: IA and IRT. Although today IRT is the pre-dominant measurement model, IA is still frequently employed by psychometricians, test developers, and tutors for a number of reasons. First, concepts of IA are simpler than those of their IRT counterpart: even the tutors without a strong statistical background could easily interpret the results without going through a steep learning curve. Second, IA could be computed by many popular statistical software programs, including SAS, while IRT necessitates use of specialized software packages such as Bilog, Winsteps, Multilog, RUMM (Yu and Wong, 2003; Yu, 2005). One great advantage of IRT is the invariance of ability and item parameters: it is the cornerstone of IRT and the major distinction between IRT and IA (Hambleton & Swaminathan, 1985). One drawback, however, of IRT is that a big sample size is necessary for the estimation of parameters. Nevertheless, empirical studies, examining and/or comparing the invariance characteristics of item statistics from the two measurement frameworks, have observed that it is difficult to find a great invariance or any other obvious advantage in the IRT based item indicators (Stage, 1999). For our study, IA has been preferred over IRT due to the following main reasons: it needs a smaller sample size for obtaining statistically significant indicators; it is easier to use IA indicators to compose rule conditions. The following statistical indicators are available from IA and other models, such as distractor analysis: • difficulty: a real number between 0 and 1 which expresses a measure of the difficulty of the item, intended as the proportion of learners who get the item correct. • discrimination: a real number between -1 and 1 which expresses a measure of how well the item discriminates between good and bad learners. Discrimination is calculated as the point biserial correlation coefficient between the score obtained on the item and the total score obtained on the test. The point biserial is a measure of association between a continuous variable (e.g. the score on the test) and a binary variable (e.g. the score on a multiple choice item). • frequency(i): a real number between 0 and 1 which expresses the frequency of the i-th option of the item. Its value is calculated as the percentage of learners who choose the i-th option. • discrimination(i): a real number between -1 and 1 which expresses the discrimination of the i-th option. Its value is calculated as the point biserial correlation coefficient between the result obtained by the learner on the whole test and a dichotomous variable that says whether the i-th option was chosen (yes=1, no=0) by the learner or not. • abstained_freq: a real number between 0 and 1 which expresses the frequency of the abstention (no answers given) on the item. Its value is calculated as the percentage of learners who did not give any answer to the item, where allowed. • abstained_discr: a real number between -1 and 1 which expresses the discrimination of the abstention on the item. Its value is calculated as the point biserial correlation coefficient between the result obtained by the learner on the whole test and a dichotomous variable that says whether the learner refrained or not (yes=1, no=0) on the item. Discrimination and difficulty are the most important indicators. They can be used for both determining item quality and choosing advice for tutors. As experts suggest (Massey, 2007), a good value for discrimination is about 0.5. A positive value lower than 0.2 indicates that the item does not discriminate well. This can be due to several reasons, including: the question does not assess learners on the desired knowledge; the stem or the options are badly/ambiguously expressed; etc. It is usually difficult to understand what is wrong with these items and more difficult to provide a suggestion to improve them, so, if the tutor cannot understand the problem her(him)self, the suggestion is to discard the item. A negative value for discrimination, especially if joined with a positive value for the discrimination of a distractor, is a sign of a possible mistake in choosing the key (a data entry error occurred). In this case it is easy to recover the item by changing the key. If difficulty is too high (>0.85) or too low (<0.15), there is the possibility that the item does not correctly evaluate the learners on the desired knowledge or subject. This is particularly true when such values for difficulty are sought together with medium-low values for discrimination. Furthermore, our system allows the tutor to define the foreseen difficulty for an item. In a test, in order to better assess a heterogeneous class with different levels of knowledge, it is important to balance the difficulty of the items: for example, in the preparation of the Michigan Educational Assessment Program (MEAP, 2007), "easy" and "difficult" items are used in every form to balance the difficulty level of the items. Having a precise estimation of item’s difficulty allows the tutor to correctly assign it to a test section on the basis of its difficulty, when composing tests. Thus, the closer a tutor’s estimation of item difficulty is to the actual calculated difficulty for that item, the more reliable that item is considered to be. When difficulty is too high or underestimated, this can be due to the presence of a distractor (noticed for its high frequency) which is too plausible (it tends to mislead a lot of students, even skilled ones). Removing or substituting that distractor can help in obtaining a better item. Sometimes, the item has its intrinsic difficulty and it can be difficult to adjust it, so the suggestion can be to modify the tutor’s estimation. As for distractors, they can contribute to form a good item when they are selected by a significant number of students. When the frequency of the distractor is too high, there could be an ambiguity in the formulation of the stem or of the distractor. A good indicator of distractors’ quality is their discrimination, which should be negative, denoting that the distractor was selected by untrained students. In conclusion, a good distractor is the one which is selected by a small but significant number of untrained students. High abstention is always a symptom of high difficulty for the item. When it is accompanied by a high (not negative or next to 0) value for its discrimination and a low value for item discrimination, it can tell that the question has a bad quality and it is difficult to improve it. 4. THE APPROACH The approach consists of administering tests to the learners through a suitable e-testing system. On the basis of test outcomes, the system evaluates the items and suggests the tutor the most suitable action to undertake on each of them. This is possible after a test has been administered to a statistically significant number of learners. In general, the quality improvement is obtained in two ways: • through the increment of item discrimination. This objective is pursued by both eliminating and opportunely modifying items with low discrimination. • by having the tutor’s estimation of the difficulty closer to the calculated difficulty for the item. In the most desirable cases (when possible), the system suggests how to modify the item. Otherwise, the estimation must be modified. Though item quality can be improved after the first test session in which it is used, the items can be evaluated by the system through subsequent test sessions, following the lifecycle shown in Figure 1. The figure shows a UML activity diagram, in which the role of the tutor and the role of the system are specified in two different swimlines. Figure 1. Item Lifecycle The item starts its lifecycle when it is created by the tutor. Then, the tutor selects the item for a test session. The test is administered to the learners through the system in a test. At the end of the session, the system stores learners’ outcomes. Such outcomes are used to calculate statistical indicators, which are used in the production rules for item evaluation. The output of the evaluation is the state of the item, whose value is expressed through a traffic light. Later on, according to the system output, the tutor decides the destiny of the item as follows: • State = Green : the item has good performances and can be re-used for future test sessions. • State = Red: the item has bad performances and should be discarded. • State = Yellow: the item has bad performances, but its quality can be improved. The system suggests how. The item is modified by the tutor and can be re-used for future test sessions. It is worth noting that the system just suggests the tutor the most suitable action. Figure 1 shows the case in which the tutor follows the suggestion of the system. Nevertheless, the tutor can choose not to follow the system’s suggestion if s/he thinks it is opportune. 5. THE SYSTEM Typically, rule-based systems are composed of an inferential engine, a knowledge-base and a user interface (Momoh et al.; 2000). Our system follows this model. The knowledge-base has been mostly inferred by translating into rules the verbal knowledge presented in section 3. Since such knowledge does not completely cover all of the aspects considered in our system, it has been integrated with knowledge extracted from data. The inferential engine works by performing a classification of the items. Several classes of items have been identified, and each class is associated to a production rule. Fuzzy sets have been used in order to cope with linguistic uncertainty contained in the rules: sources of uncertainty in our system are associated to both the conditions in the antecedents of the rules and to the combination of the rules themselves. The DoF of a rule tells the membership of the item to the corresponding class. The classification is performed by selecting the class for which the DoF is the highest. This model fits well our question item classification problem, since, in most cases (except for Class 1, see Table 3), the belonging to a class indicates the presence of a defect affecting the item. By choosing to classify the item to the class to which the item belongs with the maximum degree, a decision is taken according to the heaviest problem affecting the item. The system has been equipped with a Web-based user interface. Rather than developing it from scratch, we have preferred to integrate the system in an existing Web-based e-testing platform. To elaborate, the system has been implemented as a Java Object Oriented framework, called Item Quality Framework, which can be instantiated in any Java-based e-testing platform. Our choice is fallen on eWorkbook, already in use at our faculty. The Knowledge-Base This section describes the process for obtaining the fuzzy production rules from the knowledge. As already pointed out, the rules have been mostly inferred from the verbal knowledge presented in section 3 and integrated with knowledge extracted from data. The integration has only been necessary for modeling a few membership functions. Variables and Fuzzyfication The set of variables used are reported, together with an explanation of their meaning and the set of possible values they can assume (terms), in Table 1. These variables are directly chosen from the statistical indicators presented in section 3 or derived from them. The discrimination and difficulty variables are the same indicators for item discrimination and difficulty defined in section 3. The same discourse is valid for the variables related to the abstention, abst_frequency and abst_discrimination. difficulty_gap is a variable representing the error in tutor’s estimation of item difficulty: through the system interface, the tutor can assign one out of three difficulty levels to an item (easy = 0.3; medium = 0.5; difficult = 0.7). difficulty_gap is calculated as the difference between the tutor estimation and the actual difficulty calculated by the system. Three variables representing the frequency of the distractors for an item have been considered: max_distr_freq, min_distr_freq, distr_freq. Their value is not an absolute frequency, but relative to the frequency of the other distractors: it is obtained by dividing the absolute frequency by the mean frequency of the distractors of the item. In the case of items with five options, as our system has been tested, their value is a real number varying from 0 to 4. Table 1. Variables and Terms Variable Explanation Terms discrimination Item’s discrimination (see sec. 3) Negative, low, high difficulty Item’s difficulty (see sec. 3) Very_low, medium, very_high difficulty_gap The difference between the tutor’s estimation of item’s difficulty and the difficulty calculated by the system Underestimated, correct, overestimated max_distr_discr The maximum discrimination for the distractors of an item Negative, positive max_distr_freq The maximum (relative) frequency for the distractors of an item. Low, high min_distr_freq The minimum (relative) frequency for the distractors of an item Low, high distr_freq The (relative) frequency of the distractor with maximum discrimination for an item Low, high abst_frequency The frequency of the abstentions for an item Low, high abst_discrimina tion The discrimination of the abstentions for an item Negative, positive Membership Functions As for the membership functions of fuzzy sets associated to each term, triangular and trapezoidal shapes have been used. Most of the values for the bases and the peaks have been established using the expertise. Only for some variables, the membership functions have been defined on an experimental basis. While we already had clear ideas on how to define most of them, we did not have enough information from the knowledge on how to model membership functions for the variables related to abstention (abst_frequency and abst_discrimination). A calibration phase was required in order to refine the values for the bases and peaks of their membership functions. As a calibration set, test results from the Science Faculty Admission Test of the 2006 year were used. The calibration set was composed of 64 items with 5 options each. For each item, about one thousand records (students answers) were available, even if only a smaller random sample was considered. Test items and their results were inspected by a human expert who identified items which should have been discarded due to low discrimination and anomalous values for the variables related to abstention. We have found 5 items satisfying the conditions above: the mean values for abst_discrimination and abst_frequency were, respectively, 0.12 and 0.39, as shown in Table 2. Due to the limited size of the calibration set, the simple method of choosing the peaks of the functions at the mean value, as shown in (Bardossy & Duckstein; 1995), has been used. When more data will be available, a more sophisticated method will be used for the definition of membership functions, such as the one proposed in (Civanlar & Trussel; 1986). Charts for the membership functions are shown in Figure 2. Table 2. Anomalous values forvariables related to abstention. Question Id abst_discrimination abst_frequency 23 0.03 0.26 29 0.10 0.53 33 0.14 0.42 34 0.18 0.32 61 0.17 0.42 Mean 0.12 0.39 Rules Figure 2. Membership Functions Table 3. Rules From the verbal description of the knowledge presented in section 3, the rules summarized in Table 3 have been inferred. The first three columns in the table contain, respectively, the class of the item, the rule used for classification and the item state. For items whose state is yellow, the fourth column contains the problem affecting the item and the suggestion to improve its quality. Conditions in the rules are connected using AND and OR logic operators. The commonly-used min-max inference method has been used to establish the degree of fulfillment of the rules. All the rules were given the default weight (1.0), except for the first one (0.9). By modifying the weight of the first rule, we can tune the sensitivity of the system: the lower this value, the higher the probability that anomalies will be detected in the items. Some suggestions in the last column advise to perform an operation on a distractor. The distractor to modify or eliminate (in case of rules 4, 7 and 10) or to select as correct answer (rule 9) is signaled by the system. An output variable x has been added to the system to keep the identifier of the distractor. It is worth noting that the most important IA statistical indicators have been employed more frequently than other indicators. For example, the discrimination, which is an good indicator for the overall quality of an item, is present in 8 rules out of 10, while a more specific indicator, such as distractor discrimination has only been employed in 2 rules. The Inferential Engine The inferential engine performs a process composed of the following steps: 1. Obtaining input data from the e-testing platform; 2. Construction of the item data matrix; 3. Item classification; 4. Giving output to the e-testing platform. Class Rule State Problem and Suggestion 1 discrimination IS high AND abst_discrimination IS negative WITH 0.9 Green / 2 discrimination IS low AND abst_frequency IS high AND abst_discrimination IS positive Red / 3 difficulty IS very_low AND discrimination IS low Red / 4 difficulty IS very_high AND discrimination IS low AND max_distr_freq IS high Yellow Item too difficult due to a too plausible distractor, delete or substitute distractor x. 5 difficulty_gap IS overestimated AND discrimination IS low Yellow Item difficulty overestimated, avoid too plausible distractors and too obvious answers. 6 difficulty_gap IS overestimated AND discrimination IS NOT low Yellow Item difficulty overestimated, modify the estimated difficulty. 7 difficulty_gap IS underestimated AND max_distr_freq IS high Yellow Item difficulty underestimated due to a too plausible distractor, delete or substitute distractor x. 8 difficulty_gap IS underestimated AND max_distr_freq IS NOT high Yellow Item difficulty underestimated, modify the estimated difficulty. 9 max_distr_discr IS positive AND discrimination IS negative Yellow Wrong key (data entry error), select option x as the correct answer. 10 discrimination IS high AND max_distr_discr IS positive AND distr_freq IS NOT low Yellow Too plausible distractor, delete or substitute distractor x. In step 1, data are obtained from the e-testing platform in which the Item Quality Framework is instantiated. This operation required the development of a wrapper to access the e-testing platform database. The input data obtained at the previous step, are used in step 2 for the construction of the item data matrix which reports, for each item, the value of the following attributes: • N: number of options; • key: the index of the right option; • discrimination: item discrimination; • difficulty: item difficulty; • tutor_difficulty: tutor’s estimation for item difficulty; • discrimination (1); … ; discrimination(N): N columns containing the discrimination of each option. • difficulty (1); … ; difficulty(N): N columns containing the difficulty of each option; • abstained_discr: discrimination of the abstention on the item; • abstained_freq: frequency of the abstention on the item. Item classification is performed, at step 3, by firing the rules. Before the rules can be fired, their variables must be assigned to values directly taken from the item data matrix (e.g. discrimination, difficulty, etc.) or derived from them (e.g. difficulty_gap, max_distr_freq, etc.). Then, the rules are fired and a new matrix containing the DoF for each item and for each class is obtained. As stated before, the item is classified in the class with the maximum DoF. Lastly, at step 4, the output with item state, problem and suggestion, is passed to the e-testing platform. System Implementation and Interface The system was implemented in two phases: 1. Development of the Item Quality Framework; 2. Its instantiation in an existing Web-based e-testing platform, called eWorkbook. The Item Quality Framework The system has been implemented as a Java Object Oriented framework. In this way, it would have been easily integrated in any e-testing java-based platform. The Item Quality Framework offers the following functionalities: • Implements the inferential engine; • Provides an Application Programming Interface (API) for both the construction of the item data matrix and the access to output data. For the development of the inferential engine, a free java library implementing a complete Fuzzy inference system, called jFuzzyLogic (jFuzzyLogic, 2008), has been used. The system variables, fuzzyfication, inference methods and the rules have been defined using Fuzzy Control Language (FCL, 1997) , supported by the jFuzzyLogic library. The advantage of this approach, compared to a hard-coded solution, is that membership functions and rules can be simply changed by editing a configuration file, thus avoiding to build the system again. Data can be imported from various sources and exported to several formats, such as spreadsheets or relational databases. The data matrix and the results can be saved in persistent tables, in order to avoid to perform calculations every time they must be visualized. The API is composed of two different Java classes, which allow to perform input and output to the Inferential Engine, respectively. The former contains methods for adding rows to the item data matrix. The latter contains methods for obtaining the state of an item (green, yellow, red) and, in case of yellow state, the suggestion for improving the item quality. It is worth noting that suggestions can be internationalized, that is, they can easily be translated into any language by editing a text file. Instantiation in eWorkbook Figure 3. eWorkbook Architecture (after the instantiation of the Item Quality Framework) eWorkbook is a Web-based e-testing platform that can be used for evaluating learner’s knowledge by creating (the tutor) and taking (the learner) on-line tests based on multiple choice question type. The questions are kept in a hierarchical repository. The tests are composed of one or more sections. There are two kinds of sections: static and dynamic. The difference between them is in the way they allow question selection: for a static section, the questions are chosen by the tutor. For a dynamic section, some selection parameters must be specified, such as the difficulty, leaving the platform to choose the questions randomly whenever a learner takes a test. In this way, it is possible with eWorkbook to make a test with banks of items of different difficulties, thus balancing test difficulty, in order to better assess a heterogeneous set of students. As shown in Figure 3, eWorkbook has a layered architecture. The Jakarta Struts framework (Struts, 2008) has been used to support the Model 2 design paradigm, a variation of the classic Model View Controller (MVC) approach. In our design choice, Struts works with JSP, for the View, while it interacts with Hibernate (Hibernate, 2008), a powerful framework for object/relational persistence and query service for Java, for the Model. The application is fully accessible with a Web browser. No browser plug-in installations are needed, since its pages are composed of standard HTML and ECMAScript (EcmaScript, 2008) code. The Web browser interacts with the Struts Servlet, at the Controller Layer, that processes the request and dispatches it to the Action Class, responsible for serving it, according to the predefined configuration. It is worth noting that the Struts Servlet uses the JSP pages to implement the user interfaces. The Action Classes interact with the modules of the Business Layer, responsible for the logic of the application. At this layer, the functionalities of the system are implemented in four main sub-systems: • User Management Subsystem (UMS), responsible for user management. In particular, it provides insert, update and delete facilities. • Question Management Subsystem (QMS), which manages eWorkbook’s question repository and controls access to it. • Test Management Subsystem (TMS), which manages eWorkbook’s test repository. • Course Management Subsystem (CMS), responsible for course management. In particular, it allows the insertion, update and deletion of a course. The Business Layer accesses to the Data Layer, implemented through a Relational Data Base Management System (RDBMS), to persist the data across the functionalities provided by Hibernate framework. The integration of the new functionalities in eWorkbook has required the development and integration in the platform of new modules at all the layers. In particular, a new sub-system, called Item Quality Sub- System (IQS), responsible for instantiating the framework and providing input, output and visualization functionalities, has been added at the Business Layer. Further minor modules have been added at the other layers: input of data is performed by a wrapper module that reads data from eWorkbook’s database and calls the API to fill the data matrix of the framework; the interface for browsing the item repository in eWorkbook has been modified in order to show item’s performances (difficulty and discrimination) and state (green, yellow or red). In this way, defective items are immediately visible to the tutor, who can undertake the opportune actions (delete or modify). A screenshot of the item report is shown in figure 4a. Furthermore, the platform has been given a versioning functionality: once an item is modified, a newer version of it is generated, keeping the old data in the question repository. Through this functionality, the tutor can analyze the entire lifecycle of an item, thus having a feedback on the trend of statistical indicators over time. In this way he/she can verify that the changes he/she made to the items positively affected their quality. Figure 4b shows the chart of an item improved across two sessions of tests. The improvement is visible both from the increase in the item discrimination (the green line), and in the convergence of the calculated difficulty with the tutor’s estimation of the difficulty (the continuous and dashed red lines, respectively). Figure 4. eWorkbook Interface 6. EXPERIMENTAL RESULTS We experimented the system by using it across two test sessions in a university course, and measuring the overall improvement of the items in terms of discrimination capacity and matching to the tutor’s desired difficulty. A database of 50 items was arranged for the experiment. In the first session, an on-line test, containing a set of 25 randomly chosen items, was administered to 60 students. After, items were inspected through the system interface in order to check those to substitute or modify. Once the substitutions and modifications were performed, the modified test was administered to 60 other students. Figure 7a shows a table, exported in a spreadsheet, containing a report of the items presented in the first test session and their performances. The item to eliminate are highlighted in red, while those to modify are highlighted in yellow. According to the system analysis, 5 out of 25 items must be discarded, while 4 of them must be modified. Actually, among the items to modify, for two of them (those with id 1-F-4 and 1-E-1) the difficulty was underestimated due to a distractor that was too plausible (class 7), whose text was opportunely modified. In another case (1-B-16), the difficulty was different from that estimated by the tutor, due to the intrinsic difficulty of the item (class 8). The action undertaken was to adjust the tutor’s estimation of the difficulty. Lastly, the item with id 1-F-1, with a negative discrimination, presented a suspect error in the choice of the key (class 9). To give the reader a more precise idea, two modified items (opportunely translated from Italian) have been reported in Figure 5 and Figure 6. In item 1-F-4 (Figure 5a), a distractor (option D) was sought to be “too plausible”. Since the distractor was chosen by too many learners (26 out of 60 = 0.43%), the item was much more difficult (difficulty = 0.79) than expected (medium = 0.5). The tutor modified the distractor by changing the text from “Refreshes the content of the page http://www.expedia.it/info.htm in 20 seconds” to “Refreshes 20 times the content of the page http://www.expedia.it/info.htm” (Figure 5b). Such a modification significantly decreased the distractor plausibility, thus obtaining a difficulty level (0.43) for the item closer to the desired one, in the second session. Figure 5. The versions of item 1-F-4 used for the first (a) and the second (b) test session. By inspecting item 1-F-1 (Figure 6a), the tutor verified that the chosen key was not correct, even though the distractor labeled as correct by the system was not the right answer: simply, the item did not have any correct answer. The text of the key was modified to provide the right answer to the stem (Figure 6b). Figure 6. The versions of item 1-F-1 used for the first (a) and the second (b) test session. A new test was prepared, containing the same items, except for the 5 discarded ones, substituted by 5 unused items, and for the 4 modified ones, which were substituted by newer versions. A new set of sixty students participated in this test. In the analysis of test outcomes, our attention was more focused on the eventual improvement obtained than on the discovery of new defective items. Figure 7. Results after test sessions Figure 7b shows the report of the second test session. To measure the overall improvement of the new test, compared to the previous one, the following parameters were calculated for each of the two tests: • the average discrimination of the items; • the average of the differences |tutor_difficulty – difficulty| for the items of the tests; As for parameter 1, we have observed an improvement from a value of 0.375, obtained in the first session, to a value of 0.466, obtained in the second session. As for parameter 2, we had a decrement in the mean difference between the difficulty estimated by the tutor and the one calculated by the system, passing from a value of 0.19 to 0.157 across the two sessions. It is worth noting that, in our experiment, the tests have been administered to learners enrolled to the same university course, even if across different exam sessions. The results can be considered valid with respect to the above requirement. Due to the dependency of IA results on the learners’ ability, there is no warranty that the system behaves in the expected way when radically changing the context between different sessions. 7. RELATED WORK Several different applications supporting e-testing, such as the most common Web-based e-learning platforms, such as Moodle, Blackboard, and Questionmark, evaluate item quality by generating and showing item statistics. Nevertheless, in most cases, the interpretation of their results is left to the tutors: these systems do not advise or help the tutor in improving items. Several commercial stand-alone applications are available for improving test quality through IA (Integrity, 2008; Berk & Griesemer, 1976; Lertap, 2008) or IRT (RASCAL, 2008; Gierl & Ackerman, 1996). These can import test data from e-testing systems through a text file. Some of them are Web-based applications, such as Integrity. It can perform a detailed test analysis which also identifies problem areas and includes relevant recommendations for addressing them. Differently from our system, parameters are not combined in rules: a recommendation is given when for a given parameter an anomalous value is sought. Some other systems run under specific platforms (OS or spreadsheets). A program running under MS Windows is ITEMAN (Berk & Griesemer, 1976). ITEMAN analyzes data files (ASCII format) of test item responses produced by optical mark readers (scanners) or by manual data entry to compute conventional item analysis statistics. ITEMAN offers a multiple-keying option that allows items to have more than one correct answer (e.g., for a poorly-written item), and will flag those answers which appear to function better than the keyed answer. Our system does something similar by firing rule 9. An application running in a spreadsheet is Lertap, an Excel-based classical item and test analysis program. A nice feature of this program is the so called Visual Item Analysis, suggesting an ocular approach to item analysis, and exemplifying some of the graphics made by Lertap. A model for presenting test statistics, analysis, and to collect students’ learning behaviors for generating analysis result and feedback to tutors is described in (Hsieh et al., 2003). In other approaches, the qualitative characteristics of the items are considered for different aims: IRT has been applied in some systems (Ho & Yen, 2005) and experiments (Chen et al., 2004; Sun, 2000) to select the most appropriate items for examinees based on individual ability. In (Chen et al., 2004), the fuzzy sets theory is combined with the original IRT to model uncertainly learning response. The result of this combination is called Fuzzy Item Response Theory. Winters et al. (2005), mining the data of their educational institutions, found some scores that could be analyzed with the purpose of identifying those items that were particularly good or particularly bad, giving instructors feedback that will hopefully train them to ask better questions more consistently. A work closely related to ours is presented in (Hung et al., 2004). It proposes an e-testing system, where rules can detect defective items, which are signaled using traffic lights. It proposes an analysis model based on IA. Statistics are calculated by the system both on the items and on the whole test. Unfortunately, the four rules on which the system is based seem to be insufficient to cover all of the possible defects which can affect an item. Moreover, these rules are not inferred from consolidated statistical models and use crisp values (i.e., one of them, states that an option must be discarded if its frequency is 0, independently from the size of the sample). Furthermore, it does not contain any experiment which demonstrates the effectiveness of the system in improving assessment. Nevertheless, this work has given us many ideas, and our work can be considered a continuation of it. To elaborate, our system improves the above cited one in the following aspects: • it broadens and improves the rules used to check the items; • it gives advice to tutors to improve item quality; • it manages rules uncertainty (using fuzzy logic); • it has been evaluated in an experiment. Lastly, most of the scientific literature about e-testing and structured tests focuses on item generation with automatic (Mitkov & Ha, 2003; Brown et al., 2005) or semi-automatic (Wang et al., 2007; Hoshino & Nakagawa, 2007; Chen et al., 2006) processes based on Natural Language Processing (NLP) techniques, performed on instructional documents in an electronic format. The automatic systems generate the items, while the semi-automatic ones assist the user in their generation. In general, the human intervention is anyhow necessary for verifying the good sense of the items before using them in a test. Only in a few cases, the quality of the generated items is verified through statistical model such as IA or IRT. In most cases they are inspected and eventually modified by the tutor. The evaluation of the whole system is performed by checking the percentage of the reliable items out of the number of generated ones. In conclusion, we believe that tools that automatically generate items or assist the tutor in their creation, as those described before, can be very useful, since they permit to reduce the times of the onerous item construction phase. Nevertheless, they are still far from offering optimal performances and many of the analyzed systems are tailored for a specific educational subject, mostly foreign language teaching. Our approach is more general and can be applied to any subject. Furthermore, many tutors will keep on using their own items and our system is still applicable to generated items, for further improving their quality. Our system, compared to automatic or semi-automatic ones, requires a longer time for item construction, but allows us to obtain better quality items on the following aspects: • a better discrimination capacity; • evaluation of the learners on tutor’s desired knowledge; • a difficulty level closer to tutor’s desired one. 8. CONCLUSION In this paper we have presented a rule-based system, capable of improving item quality. Our system’s knowledge-base is mostly taken from several statistical models for item evaluation and partly extracted from data. The system detects anomalies on the items and gives tutors advise for their improvement. Obviously, the system can only detect defects which are visible by analyzing results of item and distractor analysis. The strength of our system is in the possibility for all the tutors, and not only experts of assessment or statistics, to improve test quality, by discarding or, when possible, by modifying defective items. An initial experiment carried out at the University of Salerno has produced encouraging results, showing that the system can effectively help the tutors to obtain items which better discriminate between skilled and untrained students and better match the difficulty estimated by the tutor. More accurate experiments, involving a larger set of items and students, are necessary to better measure the system capabilities. Our system performs a classification of items, carried out by evaluating fuzzy rules. At present, we are collecting data on test outcomes. Even though fuzzy classification has proven itself to perform well, we intend to investigate also other classification methods, such as decision trees and Bayesian classifiers, once a large database of items and learner’s answers will be available. REFERENCES Bardossy A., Duckstein L. (1995). Fuzzy Rule-Based Modeling with Applications to Geophysical, Biological, and Engineering Systems. CRC Press, Boca Raton, USA. Berk, R. A., Griesemer, H. A. (1976). Software Review: ITEMAN: An Item Analysis Program for Tests, Questionnaires, and Scales. Educational and Psychological Measurement, 36 (1) (pp. 189-191). Blackboard (2008). Available at http://www.blackboard.com. Brachman R.J., Khabaza T., Kloesgen W., Pieatetsky-Shapiro G., Simoudis E. (1996). Mining Business Databases. Communications of the ACM, 39 (11). (pp 42-48). Bradley R.S., Barry R.G., Kiladis G. (1982). Climatic fluctuations of the western United States during the period of instrumental records. Final report to the National Science Foundation. University of Massachussets, Amherst. Brown J.C., Frishkoff G.A., Eskenazi M. (2005). Automatic Question Generation for Vocabulary Assessment. Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing (pp 819-826). Chen C.M., Duh L.J., Liu C.Y. (2004). A Personalized Courseware Recommendation System Based on Fuzzy Item Response Theory. Proceedings of the IEEE International Conference on e-Technology, e- Commerce and e-Service, Taipei, Taiwan (pp. 305—308). Chen C-Y, Liou H-C, Chang J.S. (2006). FAST: an automatic generation system for grammar tests. Proceedings of the COLING/ACL on Interactive Presentation Sessions (pp 1-4). Civanlar M.R., Trussel H.J. (1986). Constructing membership functions using statistical data. Fuzzy Sets and Systems, 18 (pp. 1—14). Costagliola G., Ferrucci F., Fuccella V., Oliveto R. (2007). eWorkbook: a Computer Aided Assessment System. International Journal of Distance Education Technology, 5 (3), (pp. 24—41). ECMAScript (2008). Standard ECMA-262, ECMAScript Language Specification, available at http://www.ecma-international.org /publications/files/ECMA-ST/Ecma-262.pdf. Exarchos T. P., Tsipouras M. G., Exarchos C. P., Papaloukas C., Fotiadis D. I., Michalis L. K. (2007). A methodology for the automated creation of fuzzy expert systems for ischaemic and arrhythmic beat classification based on a set of rules obtained by a decision tree. Arificial Intelligence in Medicine, 40 (3), (pp. 187-200). FCL (1997): Fuzzy Control Prog. Committee Draft CD 1.0 (Rel. 19 Jan 97), http://www.fuzzytech.com/binaries/ieccd1.pdf. Gierl, M. J., Ackerman, T. (1996). Software Review: XCALIBRE — Marginal Maximum-Likelihood Estimation Program, Windows Version 1.10. Applied Psychological Measuremen,t 20 (3) (pp. 303-307). Hambleton R. K., Swaminathan H. (1985). Item Response Theory - Principles und Applications. Netherlands: Kluwer Academic Publishers Group. Hibernate (2008), available at http://www.hibernate.org. Ho R.G., Yen Y.C. (2005). Design and Evaluation of an XML-Based Platform-Independent Computerized Adaptive TestingSystem. IEEE Transactions on Education, Vol. 48, No. 2 (pp. 230—237). Hoshino A., Nakagawa H. (2007). A Cloze Test Authoring System and Its Automation. Proceedings of 6th International Conference on Web-Based Learning (pp. 174- 181). Hsieh C.T., Shih T.K, Chang W.C., Ko W.C. (2003). Feedback and Analysis from Assessment Metadata in E-learning. Proceedings of the 17th International Conference on Advanced Information Networking and Applications, Xi'an, China, (pp. 155—158). Hung J.C., Lin L.J., Chang W.C., Shih T.K., Hsu H.H., Chang H.B., Chang H.P., Huang K.H. (2004). A Cognition Assessment Authoring System for E-Learning. Proceedings of the 24th Int. Conf. on Distributed Computing Systems Workshops (pp. 262—267). IA (2008). Item Analysis. Available at http://www.washington.edu/oea/pdfs/resources/item_analysis.pdf. Integrity (2008). Integrity - Item analysis and collusion detection tools. http://integrity.castlerockresearch.com/ jFuzzyLogic (2008): Open Source Fuzzy Logic (Java). Available at http://jfuzzylogic.sourceforge.net/html/index.html. Lertap (2008). Lertap 5! http://www.lertap.curtin.edu.au/ Mamdani E.H., Assilian S. (1999). An experiment in Linguistic Synthesis with a Fuzzy Logic Controller. International Journal of Human-Computer Studies, 51 (2), (pp. 135-147). Massey (2007). The Relationship Between the Popularity of Questions and Their Difficulty Level in Examinations Which Allow a Choice of Question. Occasional Publication of The Test Dev. and Res. Unit, Cambridge. MEAP (2007). State of Michigan – Department of Education. Design and Validity of the MEAP Test. Available at http://www.michigan.gov/mde/0,1607,7-140-22709_31168-94522--,00.html. Mitkov R., Ha L.A. (2003). Computer-Aided Generation of Multiple-Choice Tests. Proceedings of the HLT- NAACL 03 workshop on Building educational applications using natural language processing - Volume 2 (pp. 17-22). Momoh J., Srinivasan D., Tomsovic K., Baer D. (2000). Chapter 5: Expert Systems Applications, in K. Tomsovic, M.Y. Chov (eds.), Tutorial on Fuzzy Logic Applications in Power Systems. Moodle (2008). Available at http://moodle.org. Questionmark (2008). Available at http://www.questionmark.com. RASCAL (2008). RASCAL - Rasch Analysis Program. http://www.assess.com/xcart/product.php?productid=253&cat=29&page=1 Roiger R.J., Geatz M.W. (2007). Introduzione al Data Mining (in Italian), McGraw-Hill. Stage C. (1999). A Comparison Between Item Analysis Based on Item Response Theory and Classical Test Theory. A Study of the SweSAT Subtest READ. Available at http:// www. umu. se/ edmeas/ publikationer/ pdf/ enr3098sec.pdf. Struts (2008), The Apache Struts Web Application Framework, http://struts.apache.org Sun K.T. (2000). An Effective Item Selection Method for Educational Measurement. Proceedings of the International Workshop on Advanced Learning Technologies (pp. 105—106). Wang W., Hao T., Liu W. (2007). Automatic Question Generation for Learning Generation in Medicine. Proceedings of 6th International Conference on Web-Based Learning (pp. 198-203). Winters T., Payne T. (2005). What Do Students Know? An Outcomes Based Assessment System. Proceedings of the 2005 international workshop on Computing education research (pp. 165-172). Woodford K., Bancroft P. (2005). Multiple Choice Items Not Considered Harmful. Proceedings of 7th Australian Conference on Computing Education (pp. 109—116). Yu, C. H. (2005). A Simple Guide to the Item Response Theory (IRT) http://seamonkey.ed.asu.edu/~alex/computer/sas/IRT.pdf Yu, C. H., Wong, J. W. (2003). Using SAS for classical item analysis and option analysis. Proceedings of 2003 Western Users of SAS Software Conference. http://www.lexjansen.com/wuss/2003/DataAnalysis/c- using_sas_for_classical_item_analysis.pdf Zadeh L. A. (1977). Fuzzy Sets and Their Applications to Pattern Classification and Clustering, World Scientific Publishing Co. Inc., River Edge, NJ, USA. Zhou J., Li Q., Xu D., Chen Y., Xiao T. (2005). Fuzzy Rule-based Integrated System Multi-indicators Economic Performance Evaluation and Decision Making Support Framework. Proceedings of International Conference on Computational Intelligence for Modelling, Control and Automation/ International Conference on Intelligent Agents, Web Technologies and Internet Commerce (pp. 714-720).